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Introduction 


Prostate  cancer  is  the  second  most  common  cause  of  cancer  deaths  in  men.  Diagnosis  and 
pathogenesis  of  this  disease  is  poorly  understood.  Prostate  specific  antigen  (PSA)  test  is  still 
the  standard  diagnostic  marker  for  prostate  cancer  despite  its  serious  limitations.  Large 
proportions  of  men  are  being  diagnosed  with  prostate  cancer  but  recent  studies  imply  that  many 
of  them  don’t  need  prostate  cancer  treatment.  There  is  clearly  a  need  for  better  diagnostic  and 
prognostic  marker  in  prostate  cancer. 

Recent  advances  in  DNA  microarray  technology  that  enable  the  simultaneous 
measurement  of  the  expression  of  thousands  of  genes  in  a  single  experiment  have 
revolutionized  current  molecular  biology.  Already,  the  21st  century  is  witnessing  an  explosion  in 
the  amount  of  biological  information  on  normal  and  disease  processes.  A  large  and 
exponentially  growing  volume  of  gene  expression  data  from  microarrays  is  now  available 
publicly.  In  addition  to  gene  expression  data,  massive  amounts  of  DNA  copy  number  data  is 
also  collected  through  CGH  microarrays.  Large  amounts  of  high  throughput  genomic  and 
epigenomic  data  have  been  collected  in  prostate  cancer.  Although  these  datasets  have  been 
analyzed  in  the  literature,  there  are  opportunities  for  mining  these  datasets  in  the  context  of  all 
other  publicly  available  data.  High  throughput  genomic  data  shows  the  promise  for  discovery  of 
better  diagnostic  and  prognostic  markers. 

Body 


Previously,  we  have  published  a  novel  approach  to  discover  Boolean  implications 
between  genes  using  these  large  number  of  gene  expression  datasets.  Subsequently,  we  used 
Boolean  implications  to  successfully  predict  genes  in  B  cell  developmental  pathway  (MiDReG 
algorithm) 3"6.  My  prostate  cancer  project  proposes  to  build  on  our  successful  prediction  of 
human  B  cell  developmental  genes  which  can  predict  pathways  based  on  human  gene 
expression  datasets.  In  this  report,  we  showed  that  Boolean  implication  predicts  different  state 
of  basal  cell  development  in  normal  prostate  tissue.  The  loss  of  basal  cell  expression  in  cancer 
is  correlated  with  the  recurrence-free  survival  of  the  prostate  cancer. 

Boolean  Implication  (BooleanNet) 

We  downloaded  25,237  microarrays  in  human  Affymetrix  U133  Plus  2.0  platform  from 
NCBI’s  GEO  (Gene  Expression  Omnibus)  database  \  and  normalized  using  RMA  (Robust 
Multi-chip  Average)  algorithm  2.  Within  these  datasets  (with  thousands  of  microarrays)  we 
identified  expression  relationships  between  pairs  of  genes  (represented  by  probe  sets  on  the 
arrays)  that  follow  simple  “if-then”  rules  such  as  “if  gene  X  is  high,  then  gene  Y  is  low,”  or  more 
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simply  stated:  “X  high  =>  Y  low”  (“X  high  implies  Y  low”).  In  this  case  gene  X  and  gene  Y  are 
rarely  “high”  together.  We  call  these  relationships  “Boolean  implications”.  There  are  only  six 
different  types  of  “Boolean  implications”  possible  in  these  datasets.  Figure  1  outlines  the  six 
different  types  of  Boolean  implications  discovered  among  the  probe  sets  within  the  human  data 
sets.  In  these  scatter  plots,  each  point  represents  gene  X’s  expression  versus  gene  Y’s 
expression  within  an  individual  microarray.  Each  plot  is  divided,  based  on  thresholds,  into  four 
quadrants:  (X  low,  Y  low),  (X  low,  Y  high),  (X  high,  Y  low),  and  (X  high,  Y  high). 


Figure  1.  Boolean  Implications.  Scatter  plots  of  25,237  Affymetrix  U133  Plus  2.0  human  microarrays 
downloaded  from  NCBI’s  Gene  Expression  Omnibus  and  normalized  together.  Each  probeset  is  assigned  a 
threshold  t  (red  lines).  Expression  levels  above  t  +  0.5  (blue  lines)  are  classified  as  “high,”  expression  levels 
below  t  -  0.5  (blue  lines)  are  classified  as  “low,”  and  values  between  t  -  0.5  and  t  +  0.5  are  classified  as 
“intermediate.”  The  plots  show  the  six  different  types  of  Boolean  implication  relationships  between  a  pair  of 
genes.  Boolean  implication  is  discovered  by  identifying  a  sparse  quadrant  in  the  scatter  plot. 


A  Boolean  implication  exists  when  one  or  more  quadrants  is  sparsely  populated 
according  to  a  statistical  test  and  there  are  enough  high  and  low  values  for  each  gene  (to 
prevent  the  discovery  of  implications  that  follow  from  an  extreme  skew  in  the  distribution  of  one 
of  the  genes) 3.  There  are  four  asymmetric  Boolean  implications,  each  corresponding  to  one 
sparse  quadrant.  Two  symmetric  Boolean  implications  “equivalent”  and  “opposite”  are 
discovered  when  two  diagonally  opposite  sparse  quadrants  are  identified.  Boolean  implications 
can  also  be  extended  to  logical  combinations  of  genes.  For  example  the  Boolean  implication  “A 
=>  B”  can  be  discovered  where  A  and  B  are  either  single  gene  conditions  (e.g.,  X  high)  or  logical 
combinations  of  multiple  genes  (e.g.,  X  high  AND  Y  high). 
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MiDReG  algorithm 


We  developed  a  new  method  termed  Mining  Developmental^  Regulated  Genes  (MiDReG)  to 
predict  genes  whose  expression  is  either  activated  or  repressed  as  precursor  cells  differentiate 
4'5.  MiDReG  bases  its  predictions  on  Boolean  implications  mined  from  large-scale  microarray 
databases  and  requires  two  or  more  “end  point”  markers  for  a  given  developmental  pathway. 


Figure  2.  MiDReG  algorithm. 

Genes  in  B  cell  developmental  pathway  are  discovered 
by  using  a  Boolean  interpolation  between  two  known 
genes  KIT  and  CD  19  that  marks  the  endpoints.  KIT  is 
expressed  early  in  B  cell  development  and  CD  19  is 
expressed  late.  There  is  a  robust  Boolean  implication 
KIT  high  =^>  CD  19  low  is  observed  in  the  diverse 
collection  of  microarray  dataset  both  in  humans  and 
mice.  Genes  that  are  expressed  at  an  intermediate  step 
and  remain  high  till  the  end  are  discovered  by  identifying 
genes  with  KIT  high  =^>  X  low  and  CD  19  high  =^>  X  high 
Boolean  implications. 


Conservation  across  humans  and  mice 


19  predicted  gt'iiCB  using  KIT  mid  CDlS 
'A  v  -  I ' :  I '  Ik  Mill' I  /Oil, IV  l  SKIT*  IE  tC'lfJ 

TBClDl  PTPRCAP  NUN  33  ITGB7  01)53 

j(C3H12D  AHHGAl’J  LY9  ITGAJ 

CEXTB1  LAT2  SFPT1  ILL2RH1 


For  example,  in  studies  of  B  cell  development,  we  used  two  known  genes  KIT  and  CD19  that 
are  expressed  early  and  late  respectively  during  B  cell  development  (Figure  2).  A  conserved 
Boolean  implication  KIT  high  =>  CD19  low  is  observed  in  the  microarray  dataset.  MiDReG 
searched  for  genes  X  that  are  expressed  during  development  and  satisfy  the  implications  “KIT 
high  =>  X  low”  and  “CD19  high  =>  X  high”  (Figure  2),  which  represents  the  pattern  of  expression 
we  expect  for  genes  that  are  not  expressed  early  in  development  when  KIT  is  highly  expressed 
(KIT  high  =>  X  low),  then  upregulated  later  in  development  when  CD19  is  also  upregulated 
(CD19  high  =>  X  high).  The  predicted  genes  were  successfully  validated  in  collaboration  with 
the  Weissman  lab  at  Stanford  University. 

Novel  prostate  cancer  pathway  modeling  using  Boolean  implication 

We  focused  on  modeling  a  differentiation  pathway  in  human  prostate  cancer  tissue  using 
Boolean  implication.  This  approach  was  motivated  by  our  previously  published  MiDReG 
algorithm  that  predicts  developmentally  regulated  genes  using  Boolean  implication  3’4.  We  first 
collected  publicly  available  gene  expression  datasets  from  human  prostate  cancer  samples 
(Supplementary  Figure  1).  To  analyze  the  datasets  using  BooleanNet  algorithm,  we  also 
downloaded  25,237  Affymetrix  U133  Plus  2.0  datasets. 

In  most  human  epithelial  tissues  both  Keratin  5  (K5)  and  Keratin  14  (K14)  are  expressed 
in  the  basal  cell  compartments.  We  analyzed  gene  expression  values  of  K14  and  K5  that  is 
presented  in  the  form  of  a  scatterplot  with  25,237  points  representing  diverse  microarrays  on 
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human  samples  including  different  normal  and  cancer  tissues  (Supplementary  Figure  2).  We 
summarize  the  gene  expression  relationship  between  K14  and  K5  as  “if  K14  high  then  K5  high” 
or  alternatively  a  Boolean  implication  relationship  “K14  high  =>  K5  high”.  The  relationship  clearly 
suggests  that  K14+  arrays  are  a  subset  of  K5+  arrays.  Since  not  all  cells  within  a  sample 
express  K14  and  K5,  we  could  hypothesize  that  K14+  cells  are  a  subset  of  K5+  cells 
(Supplementary  Figure  2A)  based  on  the  Boolean  implication.  Panel  A  shows  a  likely  model  of 
developmental  gene  regulation  between  K14  and  K5,  where  K14  is  upstream  of  K5 
(Supplementary  Figure  2). 

To  evaluate  whether  Keratin  gene  expression  is  associated  with  patient  outcome,  we 
investigated  the  status  of  three  Keratin  expression  groups  (KRT14+KRT5+,  KRT14-KRT5+, 
KRT14-KRT5-)  on  recurrence-free  survival  (RFS)  in  three  independent  prostate  cancer  cohorts 
(Singh  2002  dataset,  n=1 02;  Glinsky  2004  dataset,  n=78;  Taylor  2010  dataset,  n=185),  The 
results  confirmed  that  KRT14-KRT5-  tumors  were  associated  with  worse  clinical  outcomes  (B). 

In  addition,  KRT14+KRT5+  tumors  were  associated  with  best  clinical  and  KRT14-KRT5+  tumors 
were  associated  with  intermediate  clinical  outcome. 

Training  tasks 

The  statement  of  work  includes  several  tasks  on  career  development.  I  attended  the  2010 
Scientific  Management  Series  from  the  office  of  postdoctoral  affairs.  The  goals  and  objectives  of 
the  Scientific  Management  Series  are  to  provide  participants  with  laboratory  or  research 
management  skills  that  will  help  them  to  launch  productive  independent  careers  in  academic 
and  other  settings.  All  coursework  was  completed  including  Stats  141  in  the  Fall  2012  and  the 
Systems  Biology  in  spring  201 1 .  I  have  been  meeting  with  Professor  Joe  Lipsick  weekly  and 
Professor  Jonathan  Pollack  biweekly.  These  meetings  are  extremely  useful  for  my  career 
development  as  I  get  a  lot  of  advice  on  both  research  as  well  as  career  from  both  of  my 
mentors.  I  have  been  attending  the  weekly  seminar  on  Molecular  Profiling  Colloquium.  I  attend 
urology  seminars  regularly  every  Monday.  I  have  been  developing  biological  skills  at  the  Lipsick 
lab  and  the  Pollack  lab.  I  have  already  acquired  several  biological  skills  such  as  PCR, 
Immunostaining  to  perform  my  own  biological  experiments.  Overall,  my  training  on  cancer 
biology  has  been  very  extensive.  I  have  already  performed  immunostaining  on  human  tissues 
myself. 
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Key  Research  Accomplishments 


1.  Collection  of  high-throughput  genomic  and  epigenomic  data. 

As  mentioned  in  my  statement  of  work,  I  was  planning  to  collect  publicly  available  gene 
expression  datasets  for  Boolean  implication  analysis.  During  last  two  years  I  have 
collected  several  publicly  available  gene  expression  datasets  from  National  Center  for 
Biotechnology  Information’s  (NCBI)  Gene  Expression  Omnibus  (GEO)  webpage  and  the 
Array  Express  webpage  \  This  collection  not  only  includes  prostate  cancer  but  also 
includes  gene  expression  data  on  other  human  cancer  and  normal  tissues.  Since  all 
data  available  on  a  particular  Affymetrix  platform  can  be  normalized  together,  a  large 
database  of  gene  expression  data  can  be  built  that  can  be  analyzed  simultaneously.  My 
largest  database  includes  45,000  Affymetrix  microarrays.  In  addition  to  this,  I  have 
collected  23  different  prostate  cancer  datasets  in  different  platforms  (Supplementary 
Figure  1).  For  my  research  work,  it  is  important  to  have  gene  expression  data  annotated 
with  clinical  information  such  as  Survival.  Among  the  23  different  prostate  cancer 
datasets,  Survival  data  was  available  for  only  five  different  datasets  (Glinsky  et  al., 
Lapointe  et  al.,  Gulzar  et  al.,  Sboner  et  al.,  Taylor  et  al.;  Supplementary  Figure  1) 10"14. 
Since  these  datasets  are  in  different  platform,  they  cannot  be  combined  together.  To 
build  a  large  prostate  cancer  specific  database,  I  combined  14  different  datasets  (global 
prostate  cancer  database,  total  n=891)  that  are  in  Affymetrix  U133A  (n=456),  U133A  2.0 
(n=72),  or  U133  Plus  2.0  (n=363),  selected  common  probesets  for  normalization.  All 
these  891  samples  were  normalized  together  using  standard  RMA  algorithm.  Following 
are  the  summaries  of  the  accomplishments. 

a.  Collected  45,000  Affymetrix  microarrays  from  NCBI’s  GEO 

b.  Collected  23  different  prostate  cancer  datasets 

c.  Annotate  five  prostate  cancer  datasets  with  Survival  data 

d.  Combine  14  prostate  cancer  database  to  build  a  global  prostate  cancer  database 
(n=891). 

2.  Analysis  of  the  datasets. 

I  have  performed  all  required  analysis  on  the  collected  datasets  using  my  previously 
published  algorithms.  Following  are  the  summaries  of  the  accomplishments. 

a.  Built  a  complete  Boolean  implication  network  with  45,000  Affymetrix  microarrays. 

I  used  my  previously  published  Boolean  Net  algorithm  (Sahoo  et  al.  Genome  Biology. 
2008 3)  on  the  newly  collected  dataset  of  45,000  Affymetrix  microarrays. 

b.  Identified  developmental  genes  using  MiDReG  approach. 
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I  used  an  approach  similar  to  MiDReG  (Mining  Developmental ly  Regulated  Genes 
4,5J  to  identify  developmental  genes  in  prostate  tissue  (Supplementary  Figure  2). 
Human  prostate  cancer  is  typically  characterized  by  luminal  cell  expansion  and  the 
absence  of  basal  cells.  In  normal  prostate  tissue  basal  cell  express  Keratin  5  (KRT5) 
and  Keratin  14  (KRT14).  There  is  a  significant  Boolean  implication  between  KRT5 
and  KRT14:  "KRT14  high  =>  KRT5  high".  In  other  words,  KRT14+  cells  are  a  subset 
of  KRT5+  cells.  Assuming  that  basal  cells  differentiates  to  a  luminal  cell,  luminal  cells 
are  predominantly  KRT14-  cells,  and  KRT14  expression  change  once  during  the 
development,  we  predict  that  KRT14+KRT5+  cells  are  upstream  of  KRT14-KRT5+ 
cells,  which  could  be  upstream  of  KRT1 4- KRT 5-  luminal  cells  in  normal  prostate 
tissue. 

c.  Identified  correlation  between  developmental  genes  and  clinical  outcome. 

I  identified  three  groups  of  patients  in  three  independent  microarray  datasets  KRT14- 
KRT5-,  KRT14-KRT5+,  and  KRT14+KRT5+  (Supplementary  Figure  3).  Recurrence 
free  survival  analysis  of  these  three  independent  datasets  revealed  that  KRT14- 
KRT5-  patients  have  the  worst,  KRT14+KRT5+  patients  have  the  best,  and  KRT14- 
KRT5+  patients  have  intermediate  clinical  outcome  (Supplementary  Figure  3).  This 
result  correlates  well  with  the  systematic  loss  of  basal  cells  in  prostate  cancer. 

3.  Verify  the  results 

My  validation  experiment  was  performed  directly  on  human  prostate  tissues  instead  of 
human  cell  culture.  We  have  performed  KRT14  and  KRT5  immunohistochemistry  on  218 
human  prostate  tissues  using  a  tissue  microarray.  We  discovered  only  2  KRT5  positive 
human  prostate  cancer  tissues  and  all  of  them  were  KRT14  negative.  This  is  consistent 
with  our  hypothesis  of  systematic  loss  of  basal  cells  in  prostate  cancer.  Basal  cells  in 
human  prostate  tissue  express  KRT14  and  KRT5  and  we  do  not  see  their  expression  in 
human  prostate  cancer.  Therefore,  we  believe  that  the  correlation  of  KRT14  and  KRT5 
gene  expression  to  recurrence-free  survival  must  be  coming  from  the  surrounding 
normal  human  prostate  tissues  in  prostate  cancer.  This  is  an  important  finding  that  can 
reveal  the  underlying  biology  of  human  prostate  cancer. 

Reportable  Outcomes 

1.  Abstract  presentation  in  International  Society  for  Stem  Cell  Research  (ISSCR)  10th 
Annual  Meeting,  Jun  13-16,  2012,  Yokohama,  Japan.  (Appendix  A) 

2.  Informatics  databases: 

a.  Prostate  cancer  database  (Supplementary  Figure  1) 

b.  Global  Affymetrix  gene  expression  database  (Supplementary  Figure  2) 

c.  Bladder  cancer  database  (Published  in  PNAS,  Appendix  B) 
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d.  Colon  cancer  database  (Submitted  to  NEJM) 

e.  Breast  cancer  database  (Working  draft) 

f.  Ovarian  cancer  database  (Working  draft) 

g.  Brain  cancer  database  (Working  draft) 

3.  Manuscript  published: 

a.  [Appendix  B  7]  Debashis  Sahoo*,  Jens-Peter  Volkmer*,  Robert  Chin*,  Philip 
Levy  Ho,  Chad  Tang,  Antonina  V.  Kurtova,  Stephen  B.  Willingham,  Senthil  K. 
Pazhanisamy,  Humberto  Contreras-Trujillo,  Theresa  A.  Storm,  Yair  Lotan, 
Andrew  H.  Beck,  Benjamin  Chung,  Ash  A.  Alizadeh,  Guilherme  Godoy,  Seth  P. 
Lerner,  Matt  van  de  Rijn,  Linda.  D.  Shortliffe,  Irving  L.  Weissman,  and  Keith  S. 
Chan.  Three  differentiation  states  risk-stratify  bladder  cancer  into  distinct 
subtypes.  PNAS,  2012  Feb  7;109(6):2078-83. 

b.  [Appendix  C  8]  Debashis  Sahoo*,  Piero  Dalerba*,  Tomer  Kalisky*,  Pradeep  S. 
Rajendran,  Mike  Rothenberg,  Anne  A.  Leyrat,  Sopheak  Sim,  Jennifer  Okamoto, 
John  D.  Johnston,  Dalong  Qian,  Maider  Zabala,  Janet  Bueno,  Norma  Neff, 
Jianbin  Wang,  Andy  A.  Shelton,  Brendan  Visser,  Shigeo  Hisamori,  Mark  van  den 
Wetering,  Hans  Clevers,  Michael  F.  Clarke*  and  Stephen  R.  Quake*.  High 
throughput  single-cell  analysis  of  colon  tumors:  biological  insights  and  clinical 
applications.  Nat  Biotechnol.  2011  Nov  13;29(12):1 120-7. 

c.  [Appendix  D  9]  Debashis  Sahoo.  The  power  of  Boolean  implication  networks. 
Front.  Physio.  23  July  2012,  3:276.  doi:10.3389/fphys. 201 2.00276  (mini  review) 

4.  Received  NIH  pathway  to  independence  award  (K99/R00)  award  (Appendix  E). 

5.  Manuscript  submitted: 

•  Debashis  Sahoo*,  Piero  Dalerba*,  Pradeep  S.  Rajendran,  Stephen  P.  Miranda, 
Shigeo  Hisamori,  and  Michael  F.  Clarke.  Gene/Protein  expression  predicts 
survival  in  human  colon  cancer.  NEJM  (Under  Review). 

6.  Manuscript  in  preparation: 

•  Debashis  Sahoo*,  Jonathan  R.  Pollack,  Joseph  Lipsick,  and  James  D.  Brooks. 
Gene/Protein  expression  predicts  survival  in  human  prostate  cancer. 


Conclusion 

We  showed  that  Boolean  implication  predicts  different  state  of  basal  cell  development  in  normal 
prostate  tissue.  The  loss  of  basal  cell  expression  in  cancer  is  correlated  with  the  recurrence-free 
survival  of  the  prostate  cancer. 
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NORMAL  AND  MALIGNANT  HUMAN  TISSUES. 

Sahoo,  Debashis1,  Dalerba,  Piero2,  Volkmer,  Jens-Peter3,  Chin,  Robert  K.4,  Tang,  Chad3,  Willingham, 
Stephen  B.3,  Chan,  Keith  S.3,  van  de  Rijn,  Matt1,  Shortliffe,  Linda  D.5,  Clarke,  Mike  F.3,  Lipsick,  Joseph1, 

Weissman,  Irving  L.1 
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Many,  if  not  all  organs  and  tissues  consist  of  self-renewing  stem  cells  that  give  rise  to  distinct,  sequential 
progenitors  with  increasingly  limited  development  potential,  ultimately  producing  functional  mature  cells. 
All  malignancies  develop  from  cells  within  such  hierarchies,  requiring  progression  of  events  resulting  in 
tumor  cells  that  are  capable  of  self-renewal,  survival,  migration,  and  likely  also  differentiation.  The 
identification  and  characterization  of  stem,  progenitor,  and  mature  cells  within  normal  and  diseased 
tissue  are  not  only  critical  for  the  understanding  of  underlying  biology  but  also  in  developing  more 
effective  therapeutic  strategies.  Previous  attempts  to  identify  markers  for  cells  at  hierarchical  stages  of 
tissue  differentiation  involved  either  1)  large  screening  studies  using  antibody  libraries  or  gene 
expression  arrays,  or  2)  focused  trials  of  established  markers  identified  in  other  normal  and  diseased 
tissues.  Unfortunately,  this  "random"  approach  is  insufficient  to  trace  complex  cellular  differentiation 
stages,  and  thus  most  often  fails.  Therefore  a  systematic  approach  to  identify  cells  within  tissue 
differentiation  hierarchies  is  required.  We  applied  systematic  computational  approaches  to  identify 
markers  of  stem  and  progenitor  cells  by  analyzing  publicly  available,  high-throughput  gene  expression 
datasets  consisting  of  more  than  2  billion  measurement  points,  and  subsequently  to  validate  them  using 
tissue  microarrays.  We  used  a  new  method  called  MiDReG  (Mining  Developmentally  Regulated  Genes) 
that  uses  Boolean  implications  to  successfully  predict  genes  in  developmental  pathways.  We  developed  a 
new  software  tool  called  HEGEMON  (Hierarchical  Exploration  of  Gene  Expression  Microarray  Online)  to 
identify  genes  expressed  in  the  stem  and  progenitor  cells  in  malignant  tissue  development.  HEGEMON 
explores  gene  expression  data  with  its  clinical  information  using  a  scatterplot  of  gene  expression  values 
from  two  genes  and  provides  a  simple  framework  for  automatic  selection  of  genes  correlated  with 
distinct  patient  information,  e.g.  progression  and  survival.  Using  the  above  tools  we  demonstrate  a  new 
concept  that  human  cancers  can  be  used  as  a  platform  to  study  normal  developmental  steps  of  the 
human  tissues.  We  use  examples  of  human  bladder  and  colon  cancer  to  show  the  power  of  this 
computational  approach. 
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Current  clinical  judgment  in  bladder  cancer  (BC)  relies  primarily  on 
pathological  stage  and  grade.  We  investigated  whether  a  molecular 
classification  of  tumor  cell  differentiation,  based  on  a  developmental 
biology  approach,  can  provide  additional  prognostic  information. 
Exploiting  large  preexisting  gene-expression  databases,  we  devel¬ 
oped  a  biologically  supervised  computational  model  to  predict 
markers  that  correspond  with  BC  differentiation.  To  provide  mech¬ 
anistic  insight,  we  assessed  relative  tumorigenicity  and  differentia¬ 
tion  potential  via  xenotransplantation.  We  then  correlated  the 
prognostic  utility  of  the  identified  markers  to  outcomes  within  gene 
expression  and  formalin-fixed  paraffin-embedded  (FFPE)  tissue 
datasets.  Our  data  indicate  that  BC  can  be  subclassified  into  three 
subtypes,  on  the  basis  of  their  differentiation  states:  basal,  interme¬ 
diate,  and  differentiated,  where  only  the  most  primitive  tumor  cell 
subpopulation  within  each  subtype  is  capable  of  generating  xeno¬ 
graft  tumors  and  recapitulating  downstream  populations.  We  found 
that  keratin  14  (KRT14)  marks  the  most  primitive  differentiation 
state  that  precedes  KRT5  and  KRT20  expression.  Furthermore, 
KRT14  expression  is  consistently  associated  with  worse  prognosis 
in  both  univariate  and  multivariate  analyses.  We  identify  here  three 
distinct  BC  subtypes  on  the  basis  of  their  differentiation  states,  each 
harboring  a  unique  tumor-initiating  population. 

Boolean  analysis  |  stem  and  progenitor  cells  |  biomarker  | 
cancer  stem  cell  |  systems  biology 

Bladder  cancer  (BC)  is  the  sixth  most  common  malignancy  in 
the  United  States  (1),  accounting  for  ~69,250  new  cases  and 
14,990  deaths  in  2010  (2).  The  vast  majority  (90%)  of  BCs  are 
histologically  classified  as  urothelial  carcinomas  (UCs)  (3).  UCs 
originate  from  the  bladder  urothelium,  an  epithelial  tissue  with 
a  clear  hierarchical  organization  consisting  of  three  morphologi¬ 
cally  distinct  cell  types:  basal,  intermediate,  and  umbrella  cells  (4), 
representing  early,  mid,  and  later  differentiation  states,  respec¬ 
tively.  Malignant  transformation  can  occur  in  any  of  these  cell 
types  thus  giving  rise  to  tumors  with  diverse  phenotypes  (5). 

Currently,  the  World  Health  Organization  (WHO)  BC  classi¬ 
fication  scheme  relies  primarily  on  pathologic  stage  and  histolog¬ 
ical  grade  for  prognostic  classification.  Identification  of  new 
molecular  markers  would  allow  for  improved  risk  stratification  so 
that  we  may  better  use  risk-adapted  therapies.  Recent  molecular 
profiling  of  unfractionated  BCs  has  identified  unique  prognostic 
gene  signatures  (6-17).  However,  these  gene  signatures  have  not 
been  clinically  used  and  their  biological  relevance  has  remained  to 
be  elucidated.  Here,  we  developed  a  biologically  supervised 
computational  approach  to  mine  the  extensive  repertoire  of  pub¬ 
licly  available  gene  expression  array  data  to  define  molecular 
markers  of  cellular  differentiation  consistent  across  the  range 
of  mammalian  cellular  diversification  (18).  This  algorithm  uses 
Boolean  logic  to  evaluate  large  datasets  to  identify  genes  that 


sequentially  change  expression  during  differentiation  (e.g.,  pro¬ 
genitor  genes  that  decrease  during  differentiation  with  the  con¬ 
comitant  up-regulation  of  differentiation  genes).  In  the  current 
study,  we  have  successfully  predicted  and  functionally  validated 
molecular  markers  for  multiple  differentiation  steps  in  BC  and 
analyzed  their  association  with  patient  survival. 

Results 

In  the  presented  study  we  focus  on  UCs,  hereafter  synonymously 
called  BC,  and  excluded  other  BC  subtypes  (squamous  and  ade¬ 
nocarcinomas)  from  gene-expression,  phenotypical,  functional, 
and  patient  survival  analyses. 

Overall  Strategy  to  Predict,  Functionally  Validate,  and  Associate  Differ¬ 
entiation  States  to  Survival  in  BC.  A  biologically  supervised  approach 
was  used  to  predict  markers  of  differentiation  states  in  BC  (Fig. 
1).  The  expression  patterns  of  our  two  previously  published  hi¬ 
erarchically  related  differentiation  markers  in  BC,  keratin  (KRT) 
5  and  KRT20  (19),  were  analyzed  by  the  algorithm  “mining  de- 
velopmentally  regulated  genes”  (MiDReG),  which  revealed  a 
third  differentiation  marker,  KRT14.  We  therefore  hypothesized 
the  existence  of  three  distinct  differentiation  states  marked  by 
KRT14,  -5,  and  -20,  which  are  shared  by  both  normal  urothelium 
and  BC.  We  then  used  the  algorithm  “hierarchical  exploration  of 
gene-expression  microarrays  online”  (Hegemon)  to  identify  cell 
surface  markers  corresponding  to  each  differentiation  state. 
FACS  separation  with  these  surface  marker  combinations 
allowed  for  the  isolation  of  the  respective  tumor-initiating  cell  (T- 
IC)  populations  from  clinical  samples  and  analysis  of  their  re¬ 
spective  tumorigenic  and  differentiation  potential  in  xenotrans¬ 
plantation  models.  We  then  analyzed  the  prognostic  utility  of 
these  differentiation  markers  using  patient  gene-expression  arrays 
(492  patients)  and  formalin-fixed  paraffin-embedded  (FFPE) 
(275  patients)  tissue  sets. 
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Fig.  1-  Flowchart  of  identification  and  validation  of  differentiation  states  in 
BC.  Markers  of  differentiation  (keratins)  are  identified  by  using  "mining 
developmentally  regulated  genes"  (MiDReG)  and  corresponding  cell  surface 
markers  are  identified  by  using  "hierarchical  exploration  of  gene-expression 
microarray  online"  (Hegemon).  The  hypothetical  hierarchy  of  differentiation 
is  evaluated  in  patient  tumor  cell  xenotransplantation  mouse  models.  As¬ 
sociation  of  differentiation  states  in  bladder  cancer  with  patient  outcome  is 
validated  with  existing  databases  and  archival  tissues. 


Keratin  14  Is  Predicted  to  Precede  Keratin  5  and  -20  in  Urothelial 
Differentiation.  KRTs  are  differentially  expressed  during  epithelial 
tissue  differentiation,  a  phenotype  that  is  often  conserved  in 
neoplastic  transformation  (20,  21).  During  normal  urothelial 
differentiation,  it  is  proposed  that  basal  and  intermediate  cells 
express  KRT5,  but  not  KRT20.  Conversely,  terminal  differenti¬ 
ation  involves  the  loss  of  KRT5  and  gain  of  KRT20  expression 
(22,  23)  (Fig.  24).  Immunofluorescence  analysis  confirmed  our 
previous  finding  that  coexpression  of  CD44  and  KRT5  define 
basal/progenitor  cells  in  BC,  whereas  terminally  differentiated 
tumor  cells  express  KRT20  but  not  CD44  and  KRT5  (19) 
(Fig.  SL4). 


We  developed  a  biologically  supervised  computational  ap¬ 
proach,  which  mines  the  repertoire  of  publicly  available  micro¬ 
array  data  to  identify  genes  that  are  down-regulated  during  cellular 
differentiation  (18).  Starting  with  the  knowledge  that  KRT5  and 
KRT20  expression  is  limited  to  progenitor  and  downstream  pop¬ 
ulations,  respectively  (Fig.  24),  we  applied  MiDReG  to  predict 
upstream  keratins  (KX)  that  satisfy  two  Boolean  relationships  (i) 
when  KX  expression  is  high,  expression  of  progenitor  KRT5  is 
high  (Fig.  2 B,  red/blue),  and  (ii)  when  KX  expression  is  high,  ex¬ 
pression  of  terminal  differentiation  marker  KRT20  is  low  (Fig.  2 B, 
red/green)  (details  described  in  SI  Methods)  (18,  24,  25).  Using 
AffyBC  and  Chungbuk  datasets,  we  identified  four  keratins 
(KRT14,  KRT16,  KRT6A,  and  KRT6B;  Fig.  SLF,  details  in  SI 
Methods)  that  fulfilled  these  criteria  (Fig.  2  C  and  D).  Analysis  of 
the  Chungbuk  dataset  revealed  two  keratins  significantly  associ¬ 
ated  with  outcome:  KRT14  (hazard  ratio  (HR)  2.75,  P  <  0.05),  and 
KRT6B  (HR  3.48,  P  <  0.05)  (Fig.  SLF).  We  further  focused  on 
KRT14,  as  this  marker  was  more  highly  and  consistently  expressed 
within  the  AffyBC  and  Chungbuk  datasets.  Immunofluoresence 
analysis  confirmed  KRT14  expression  (Fig.  2E,  green  cells)  marks 
a  subpopulation  of  KRT5+  cells  in  BC  (Fig.  2 E,  red  cells)  (double 
positive  cells,  yellow,  are  indicated  by  white  arrows).  Analogous  to 
BC,  KRT14  staining  on  normal  bladder  tissue  shows  a  basal-cell- 
restricted  expression  pattern  (Fig.  S2  D  andFfy  On  the  basis  of  the 
MiDReG  analysis  (Fig.  S2  A-C).  we  predicted  the  existence  of 
three  differentiation  states  in  urothelial  cells:  basal  (KRT14+ 
KRT5+KRT20“),  intermediate  (KRT14“KRT5+KRT2(T),  and 
differentiated  (KRT14“KRT5“KRT20+)  (Fig.  IF). 

Identification  of  Corresponding  Surface  Markers  to  the  Predicted 
Keratin  Differentiation  States  in  BC.  We  identified  surface  markers 
specific  for  each  of  the  three  BC  differentiation  states  to  allow  for 
prospective  isolation  by  FACS  and  in  vivo  functional  validation 
via  a  xenotransplantation  model.  To  perform  this  analysis,  we 
developed  a  software  program  named  Hegemon  (SI  Methods) 
to  identify  surface  markers  highly  expressed  in  the  basal  cells 
(KRT14+)  and  progressively  down-regulated  in  intermediate 
(KRT5+)  and  differentiated  cells  (KRT20+)  (Fig.  3 A  and  Fig. 
S3F).  Using  Hegemon,  we  ranked  each  marker  on  the  basis  of 


Fig.  2.  Keratin  14,  -5,  and  -20  define  three  differentiation  states  in  BC.  Keratins  are  abbreviated  as  KX.  (A)  K5  is  expressed  early  during  differentiation  (blue) 
and  its  expression  is  temporally  exclusive  with  that  of  the  terminal  differentiation  marker  K20  (green)  in  bladder  cancer  (BC).  The  mutual  relationship  of  K5 
and  K20  in  their  temporal  expression  is  consistent  across  diverse  tissues  (totaling  75,000  data  points)  in  multiple  species  (human,  mouse,  and  rat;  Fig.  SI  B-E). 
(B)  Schematic  illustrating  the  principle  behind  the  computational  strategy  MiDReG  used  to  predict  a  keratin  X  (KX,  red),  which  is  precursor  to  K5  and  K20  by 
fulfilling  two  Boolean  relationships:  (/)  when  KX  expression  is  high  (red),  expression  of  the  early  progenitor  marker  K5  is  also  high  (blue)  and  (/'/')  when  KX 
expression  is  high  (red),  expression  of  the  differentiation  marker  K20  is  low  (green).  (C)  K1 4  fulfills  the  first  Boolean  relationship,  its  expression  is  high  (red) 
when  the  expression  of  early  progenitor  marker  K5  is  also  high  (blue).  (D)  K14  fulfills  the  second  Boolean  relationship,  its  expression  is  low  (red)  when  the 
expression  of  terminal  differentiation  marker  K20  is  high  (green).  ( E )  KRT1 4-expressing  cells  (Alexa  488/green)  mark  a  subpopulation  of  KRT5+  cells  (Alexa 
594/red)  in  BC;  white  arrows  indicate  double  positive  cells,  yellow.  ( F )  Schematic  illustration  of  the  three  predicted  differentiation  states  in  urothelial  biology. 
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association  with  patient  survival  (via  hazard  ratios)  (Dataset  SI) 
and  identified  CD248,  S100A8,  COL1A1,  and  CD90  (THY1)  as 
the  top  candidate  markers  (Fig.  3 A  and  Fig.  S3F  and  Dataset  SI). 
We  focused  on  CD90,  because  a  flow-cytometry-compatible  an¬ 
tibody  was  commercially  available.  As  expected,  our  previously 
identified  marker,  CD44,  was  also  demonstrated  to  exhibit 
a  predominant  basal  (KRT14+)  distribution  (Fig.  3 A  and  Fig.  S3F 
and  Dataset  SI).  We  next  used  Hegemon  to  identify  those  sur¬ 
face  markers  that  are  expressed  in  all  cells  but  down-regulated  in 
the  transition  from  basal  to  differentiated  cells  (Fig.  3 B  and  Fig. 
S3G  and  Dataset  SI).  From  this  group,  we  focused  on  CD49f 
(ITGA6),  as  this  marker  has  been  reported  to  be  coexpressed 
with  KRT14  and  is  down-regulated  during  differentiation  in 
various  normal  epithelial  tissues  and  cancer  types  (26,  27). 

Next,  we  used  flow  cytometry  to  examine  whether  a  combination 
of  these  newly  identified  markers,  CD90  and  CD49f,  and  the 
previously  identified  marker  CD44  could  subdivide  BC  into  dis¬ 
tinct  differentiation  states.  Analysis  of  primary  tumors  revealed 
four  predicted  BC  populations:  CD90+CD44+CD49f+  (primitive/ 
basal)  -+  CD90“CD44+CD49f+  -+  CD90“CD44“CD49f+ 
CD90_CD44_CD49f_  (terminal  differentiated)  (Fig.  3D).  Gene 
expression  of  KRT14,  -5,  and  -20  in  each  of  these  purified  sub¬ 
population  was  analyzed  by  q-PCR  (Fig.  3C).  As  expected,  prim¬ 
itive/basal  CD90+CD44+CD49f+  BC  cells  expressed  high  levels  of 
KRT14  and  -5  (Fig.  3C,  red  and  blue)  and  low  levels  of  KRT20 
(Fig.  3C,  green).  KRT14  and  -5  expression  were  decreased  in  the 
CD9CTCD44+CD49f+  intermediate  population  and  had  the  low¬ 
est  expression  in  CD9(TCD44_CD49f+  differentiated  population 
(Fig.  3C).  KRT20  expression  was  highest  in  the  CD9CT 
CD44-CD49f+  differentiated  population  (Fig.  3C). 

Functional  Validation  of  Three  BC  Subtypes.  To  functionally  validate 
these  predicted  BC  differentiation  states,  we  used  our  unique 
surface  marker  profiles  to  isolate  populations  corresponding  to 
each  differentiation  state  from  patient  BCs  using  FACS  (Fig.  44). 
These  isolated  populations  were  then  transplanted  in  vivo  into 


immunodeficient  SCID  mice.  As  noted  above,  only  the  most  up¬ 
stream  population  harbored  T-IC  potential  in  all  BCs  tested.  For 
example,  in  a  representative  BC  that  contained  all  four  differen¬ 
tiation  states  (Fig.  44),  only  the  most  primitive  tumor  cells 
(CD90+CD44+CD49f+)  exhibited  tumorigenicity  in  vivo  (Fig.  4G, 
basal),  regenerating  all  downstream  populations  (Fig.  44)  and 
effectively  reconstituting  all  cellular  compartments  from  the 
original  BC.  Interestingly,  within  this  same  tumor,  transplantation 
of  a  more  downstream  population  (CD9CTCD44+CD49f+)  failed 
to  reestablish  the  tumor  (Fig.  44). 

Examination  of  a  panel  of  patient  BC  specimens  revealed  sig¬ 
nificant  heterogeneity  among  tumors,  some  missing  one  or  more 
differentiation  states  (Fig.  4  C  and  E).  On  the  basis  of  our  analyses, 
BCs  could  be  generalized  into  at  least  three  subtypes:  the  basal 
subtype,  which  contains  all  four  predicted  differentiation  states 
(CD90+CD44+CD49f+,  CD90“CD44+CD49f+,  CD90“CD44“ 
CD49f+,  and  CD9(TCD44_CD49f_)  (Fig.  44);  the  intermediate 
subtype,  which  lacks  the  basal  state  (no  CD90+CD44+CD49f+ 
population)  (Fig.  4C);  and  the  differentiated  subtype,  which  lacks 
both  the  basal  and  intermediate  (no  CD90+CD44+CD49f+  or 
CD90“CD44+CD49f+  populations)  states  (Fig.  4 E).  FACS  iso¬ 
lation  and  subsequent  xenotransplantation  of  sorted  cells  from 
each  differentiation  state  from  specimens  representing  each  BC 
subtype  revealed  that  only  the  most  primitive  upstream  pop¬ 
ulations  formed  tumors  (Fig.  4G)  (e.g.,  in  basal  BC  subtype,  CD90+ 
CD44+CD49f+  cells;  in  intermediate  BC  subtype,  CD90_CD44+ 
CD49f+  cells;  and  in  differentiated  BC  subtype,  CD9CTCD44- 
CD49f+  cells).  Furthermore,  the  T-IC  population  from  each  BC 
subtype  reformed  only  those  downstream  and  not  any  upstream 
populations  (Fig.  4  B ,  D,  and  F).  These  results  revealed  three 
phenotypically  distinct  BC  subtypes,  each  containing  a  distinct  T-IC 
population  that  invariably  represented  the  most  primitive  differ¬ 
entiation  state  from  that  tumor  (Fig.  4G). 

Basal  Subtype  Is  Significantly  Associated  with  Poor  Overall  Survival.  To 

evaluate  the  clinical  significance  of  these  three  unique  BC 
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Fig.  3.  Discovery  of  corresponding  surface  markers  to  ker¬ 
atins  for  differentiation  states  in  BC.  Keratins  are  abbrevi¬ 
ated  as  KX.  Schematics  demonstrating  the  two  criteria  used 
to  set  the  threshold  for  discovering  surface  markers  that 
would  correspond  with  the  following  differentiation  states 
(K14K5+K2(T  red;  K14~K5+K20“  blue;  K14“K5"K20+  green). 
The  discovery  analysis  was  performed  in  the  AffyBC  and  the 
Chungbuk  datasets  (red  horizontal  line  indicates  the  Step- 
Mi  ner-based  threshold).  Boxplots  with  mean  and  confidence 
interval  for  cell  surface  genes  that  fulfill  the  two  separate 
criteria  were  shown  independently.  (A)  The  threshold  was 
set  in  a  way  that  would  discover  surface  markers  that  were 
highly  expressed  in  basal  cells  (K14+K5+K20~,  red)  and 
strongly  down-regulated  during  differentiation.  ( B )  The 
threshold  was  set  in  a  way  that  would  discover  surface 
markers  that  highly  expressed  all  three  differentiation  states 
(red,  blue,  and  green),  and  slightly  down-regulated  during 
differentiation.  The  detailed  method  of  discovery  and 
ranking  of  cell  surface  markers  is  presented  in  Fig.  S3  and 
listed  in  Dataset  SI.  (C)  Messenger  RNA  expression  of  K14, 
K5,  and  K20  in  each  of  the  differentiation  states  defined  by 
corresponding  surface  markers  was  analyzed  by  real-time 
PCR.  Corresponding  surface  marker  combination  that 
defines  each  differentiation  state  was  listed  in  the  x  axis, 
representing  basal,  intermediate,  and  differentiated  states, 
respectively.  The  relative  gene-expression  level  was  in¬ 
dicated  in  the  y  axis.  (D)  Schematic  illustrating  BC  differen¬ 
tiation  states  as  defined  by  keratin  (K)  and  corresponding 
surface  marker  expression  profiles. 
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Fig.  4.  Functional  validation  of  the  computationally  predicted  differentia¬ 
tion  states  in  BC.  In  vivo  validation  of  three  phenotypically  distinct  subtypes 
of  bladder  cancer  (BC)  according  to  their  differentiation  states:  (>A)  basal,  (C) 
intermediate,  and  (E)  differentiated  as  defined  by  surface  marker  profiles 
(CD90,  CD44,  and  CD49f).  BC  cells  were  purified  by  FACS  and  xeno- 
transplanted  intradermally  into  immunodeficient  mice  in  limited  dilution 
(103  and  104).  ( B ,  D,  and  F)  The  immunophenotypes  of  xenograft  tumors 
derived  from  each  BC  subtype  were  reanalyzed  by  FACS  postengraftment. 
(A)  The  basal  BC  subtype  is  composed  of  all  differentiation  states 
[CD90+CD44+CD49f+  (red  box)  CD90“CD44+CD49f+  (blue  box) 
CD90“CD44“CD49f+  (green  box)  CD90“CD44“CD49f“  (light  blue  box)].  (B) 
Only  the  most  upstream  (CD90+CD44+CD49f+)  population  forms  xenograft 
tumors  and  recapitulates  all  downstream  differentiation  states 
(CD90“CD44+CD49f+  -+  CD90“CD44“CD49f+  CD90“CD44“CD49f").  (C)  The 

intermediate  BC  subtype  lacks  the  basal  differentiation  state 

(CD90+CD44+CD49f+).  (D)  Only  the  most  upstream  differentiation  state 
(CD90“CD44+CD49f+)  forms  xenograft  tumors  and  can  reconstitute  all 
downstream  states  (CD90“CD44“CD49f+  CD90“CD44“CD49f“).  (E)  In  dif¬ 

ferentiated  BCs  that  lack  both  the  basal  (CD90+CD44+CD49f+)  and  the  in¬ 
termediate  (CD90“CD44+CD49f+)  differentiation  states,  (E)  only  the  existing 
differentiation  state  (CD90_CD44“CD49f+)  forms  xenograft  tumors  and 
recapitulates  the  terminally  differentiated  (CD90_CD44“CD49f_)  down¬ 
stream  state.  (A,  C,  and  E)  The  terminally  differentiated  subpopulations 
(CD9CTCD44-CD49f-)  never  give  rise  to  tumors.  (G)  Frequency  of  tumor  for¬ 
mation  of  all  transplanted  cell  populations  described  in  A,  C,  and  E. 


subtypes,  we  analyzed  their  prognostic  utility  in  two  independent 
BC  gene-expression  datasets  with  a  total  of  492  patients  [Lindg- 
ren  n  =  89  (9)  and  European  n  =  403  (6),  Fig.  5].  Because  the 
Chungbuk  dataset  is  used  as  a  training  dataset  to  identify  our 
markers,  it  was  excluded  from  further  validation  analysis.  These 
datasets  represent  all  publicly  available  BC  datasets.  Survival  data 
for  two  additional  BC  datasets  were  not  available  for  analysis  (10, 
11).  The  basal  BC  differentiation  subtype,  defined  by  keratin 
(K14+K5+K20“)  or  surface  (CD90+CD44+CD49f+)  marker 
combinations  were  associated  with  worse  overall  survival  com¬ 
pared  with  both  intermediate  (K14_K5+K20_/CD90_CD44+ 
CD49f+)  and  differentiated  (K14“K5“K20+/CD90“CD44“CD49f+) 
subtypes  (Fig.  S4A  and  B).  This  result  was  additionally  validated  in 
two  independent  FFPE  patient  tissue  registries  with  a  total  of  275 
patients  (Stanford,  n  =  158;  Baylor,  n  =  117)  on  the  basis  of  im- 
munohistochemical  analysis  of  KRT14,  KRT5,  and  KRT20  expres¬ 
sion  (Fig.  S5). 

It  is  worth  noting  that  a  subset  of  patient  samples  in  our  analysis 
did  not  fit  easily  into  one  of  our  three  BC  subtypes  (others,  gray;  Fig. 
S4).  This  additional  heterogeneity  may  represent  a  block  in  differ¬ 
entiation,  which  may  occur  at  any  stage  of  differentiation.  We  be¬ 
lieve  that  such  heterogeneity  complements  the  proposed  BC 
differentiation  states  and  reveals  additional  heterogeneity  within  BC 
subtypes  that  should  be  investigated  in  future  analyses  (Fig.  S4 D). 

Basal/Primitive  Cell  Marker  Keratin  14  Is  Significantly  Associated  with 
Poor  Overall  Survival  in  BC.  Because  of  the  significantly  worse 
overall  survival  associated  with  the  basal  BC  subtype,  and  the 
clinical  applicability  of  a  single  immunohistochemistry  (IHC) 
marker,  we  evaluated  the  clinical  significance  of  KRT14  as  a 
single  basal  differentiation  marker. 

Our  analysis  revealed  that  KRT14  gene  expression  was  associ¬ 
ated  with  significantly  worse  overall  survival  in  two  independent 
datasets  (Lindgren,P  =  0.005;  European,  P  <  0.001)  (Fig.  5v4  and 
B).  In  the  European  dataset,  the  prognostic  power  of  KRT14  was 
statistically  significant  in  both  univariate  and  multivariate  analysis 
when  accounting  for  stage,  grade,  age,  and  sex  (multivariate  P  = 
0.0077,  respectively  P  =  0.021,  including  tumors  treated  with 
intravesical  bacillus  Calmette-Guerin/chemo therapy)  (Fig.  5B). 
This  prognostic  power  remained  significant  when  KRT14  gene 
expression  was  analyzed  as  a  continuous  variable  in  both  uni-  and 
multivariate  analysis  in  the  European  dataset  (Table  SI;  multi¬ 
variate  P  =  0.013,  respectively  P  =  0.02,  including  bacillus  Calm- 
ette-Guerin/chemotherapy).  Validation  by  measuring  KRT14 
protein  expression  within  two  independent  FFPE  BC  tissue  cohorts 
revealed  a  significant  association  between  KRT14  and  overall 
survival  (Stanford  P  <  0.0001,  multivariate  P  =  0.0038;  Baylor  P  = 
0.009,  multivariate  P  =  0.032;  Fig.  6  A  and  B).  It  is  important  to 
note  that  different  datasets  use  different  grading  systems.  Whereas 
the  gene  expression  datasets  are  based  on  the  3 -grade  (Lindgren) 
or  4-grade  (European)  system,  the  FFPE  BC  cohorts  (Stanford  and 
Baylor)  are  annotated  with  the  more  recently  adopted  2-grade  (low 
and  high)  system.  Nevertheless,  the  prognostic  power  of  KRT14 
holds  regardless  of  different  grading  systems.  Of  note,  although  the 
prognostic  utility  of  KRT14  is  not  confounded  by  pathological 
grade,  high  grade  tumors  are  significantly  enriched  for  KRT14 
expression  and  vice  versa  (IHC  datasets,  Pearson’s  test:  Stan¬ 
ford,  P  =  0.01;  Baylor,  P  =  0.006).  Finally,  subgroup  analysis  of 
clinically  important  BC  groups,  including  muscle  invasive  disease 
(>pT2),  low  stage  disease  (pTa),  and  patients  treated  with  radical 
cystectomy,  could  be  consistently  stratified  by  KRT14  expression  in 
all  datasets  tested  (Figs.  S6  and  S7). 

Discussion 

Identification  and  characterization  of  differentiation  steps  are 
critical  to  our  understanding  of  both  normal  tissue  development 
and  malignant  transformation.  During  normal  urothelial  differ¬ 
entiation,  it  is  generally  accepted  that  basal,  intermediate,  and 


Volkmer  et  al. 


PNAS  |  February  7,  2012  |  vol.  109  |  no.  6  |  2081 


COMPUTER  SCIENCES  MEDICAL  SCIENCES 


Lindgren  dataset 

(7/34) 


(16/51) 


p=0.005 


(3/4) 


Groups 

K14  low 

K14  int. 

K14  high 

Total 

34 

51 

4 

Tx 

1 

1 

0 

Ta 

5 

6 

0 

T1 

13 

18 

0 

T2 

7 

11 

2 

T3 

7 

13 

2 

T4 

1 

2 

0 

G2 

9 

10 

0 

G3 

25 

41 

4 

No.  at  risk 
K1 4  low  34 

K14  int.  51 

K14  high  4 


28  14 

38  23 

1  1 


B 


European  dataset 

(75/257) 


pcO.OOl  (7/10) 


20 


40 


..  ,  .  .  Months 

No.  at  risk 

K14  low  257  209  174 

K14  int.  136  108  89 

K14  high  10  3  3 


Multivariate  analysis  including  intravesical 
BCG  &  Mitomycin  therapy 


Groups 

K14  low 

K14  int. 

K14  high 

Total 

257 

136 

10 

Ta 

126 

62 

1 

T1 

106 

53 

3 

Tis 

1 

1 

0 

T2 

22 

18 

4 

T3 

2 

1 

1 

T4 

0 

1 

1 

G1-2 

67 

31 

0 

G3-4 

168 

93 

9 

M 

205 

112 

7 

F 

50 

21 

3 

Aqe  20  -  29  Y 

0 

2 

0 

Aqe  30  -  39  Y 

4 

1 

0 

Aqe  40  -  49  Y 

14 

4 

0 

Aqe  50  -  59  Y 

45 

22 

1 

Aqe  60  -  69  Y 

62 

35 

2 

Aqe  70  -  79  Y 

90 

49 

5 

Aqe  80  -  89  Y 

33 

12 

2 

Aqe  90+  Y 

2 

2 

0 

HR(lo-hi) 

P 

c 

K14 

1.99(1.11-3.58) 

0.021 

* 

Stage 

1.05(0.86-1.27) 

0.65 

Grade 

2.26(1.14-4.47) 

0.02 

* 

Age 

1.33(1.00-1.78) 

0.053 

Sex 

0.81(0.51-1.26) 

0.34 

Intrav.  Mitomycin 

0.72(0.16-3.27) 

0.67 

Intrav.  BCG 

2.77(1.25-6.12) 

0.012 

* 

Multivariate  analysis 


Stage 


Age 


HR(lo-hi) 


1.51(1.11-2.03) 


1.32(1.20-1.45) 


1.32(1.00-1.74) 


1.45(1.23-1.71) 


1.02(0.82-1.27) 


(D  O 

O 


Stanford  dataset 


B 


(22/105) 


(26/53) 


pcO.OOOl 


Baylor  dataset 


(28/89) 


(16/28) 


p=0.009 


No.  at  risk 

0 

20  40 

Months 

60 

No.  at  risk 

0 

20  40 

Months 

60 

K14  neg. 

105 

84 

61 

55 

K14  neg. 

89 

75 

65 

42 

K14  pos. 

53 

31 

23 

19 

K14  pos. 

28 

18 

13 

9 

Multivariate  analysis 


Multivariate  analysis 


HR(lo-hi) 

P 

c 

K14 

2.42(1.33-4.40) 

0.0038 

** 

Stage 

24.05(2.79-206.93) 

0.0038 

** 

Grade 

1.14(0.24-5.48) 

0.87 

Age 

1.03(1.00-1.06) 

0.042 

* 

Sex 

0.87(0.63-1.21) 

0.41 

HR(lo-hi) 

P 

c 

K14 

2.26(1.07-4.74) 

0.032 

* 

Stage 

0.71(0.44-1.15) 

0.16 

Grade 

1 .32(0.52-3.34) 

0.55 

Age 

1.03(1.00-1.06) 

0.069 

Sex 

0.93(0.63-1.38) 

0.71 

K14- 

K14  + 

Total 

Total 

105 

53 

158 

Non-invasive 

39 

11 

49 

Invasive 

65 

42 

109 

LG 

31 

6 

37 

HG 

73 

47 

120 

M 

82 

36 

118 

F 

22 

17 

39 

Avg  Age 

69.5 

73.8 

K14- 

K14  + 

Total 

Total 

89 

28 

117 

Ta 

38 

5 

43 

Tis 

1 

0 

1 

T1 

18 

2 

20 

T2 

13 

3 

16 

T3 

16 

12 

28 

T4 

3 

6 

9 

LG 

34 

3 

37 

HG 

55 

25 

80 

M 

69 

23 

92 

F 

20 

5 

25 

Avg  Age 

67.3 

70.8 

TUR-BT 

58 

8 

66 

Cvstectomv 

31 

20 

51 

Fig.  5.  Keratin  14  gene  expression  is  associated  with  worse  patient  survival 
in  BC.  Kaplan-Meier  analysis  of  the  probability  of  cancer-specific  (A)  and 
overall  ( B )  survival  according  to  differentiation  states  in  bladder  cancer  as 
defined  by  Keratin  14  (K14)  gene-expression  level  in  two  independent 
datasets,  Lindgren  04)  and  European  ( B ). 


umbrella  cells  represent  sequential  differentiation,  from  primitive 
to  mature.  It  is  likely  that  malignant  transformation  can  occur  in 
any  of  these  cell  types  to  form  tumors  with  distinct  T-IC  pop¬ 
ulations  (5).  Our  results  indicate  a  multistep  differentiation  hier¬ 
archy  in  BCs  that  parallels  normal  urothelial  differentiation.  The 
resulting  unique  classification  scheme  broadly  divides  BC  into 
three  differentiation  subtypes — basal,  intermediate,  and  differen¬ 
tiated.  We  further  demonstrated  that  each  BC  subtype  possesses 
its  own  phenotypically  distinct  T-IC  population  within  its  most 
primitive  compartment.  Such  a  T-IC  population  exists  at  the  top  of 
a  hierarchical  relationship  and  is  capable  of  reconstituting  all 
downstream  populations.  These  results  add  complexity  to  our 
originally  proposed  T-IC  model  (19)  and  suggest  BC  conforms  to 
the  cancer  stem  cell  model  (19,  28-33). 

A  subset  of  patient  samples  in  our  analysis  does  not  fit  into 
the  three  BC  subtypes,  which  may  reflect  additional  diversity. 
However,  we  did  not  find  evidence  of  cellular  plasticity  as  re¬ 
cently  described  by  Chaffer  et  al.  (34).  In  our  functional  in  vivo 
studies,  BC  cells  give  rise  to  downstream  differentiation  states 
but  are  incapable  of  reforming  upstream  populations.  More 
stringent  biological  assays  such  as  lineage  tracing  in  mice  can  be 
explored  in  future  to  provide  definitive  evidence  supporting  our 
hierarchy  model. 

Stratification  of  patients  by  BC  subtypes,  using  keratin  and  cell 
surface  markers,  showed  significant  prognostic  utility.  Moreover, 
KRT14  expression  is  strongly  associated  with  poor  survival,  in¬ 
dependent  of  established  clinical  and  pathological  variables  in¬ 
cluding  stage,  grade,  age,  and  sex.  For  example,  KRT14  identifies 
patients  with  worse  outcome  in  both  nonmuscle  invasive  (pTa)  and 


Fig.  6.  Keratin  14  protein  expression  is  associated  with  worse  patient  sur¬ 
vival  in  BC.  ( A  and  B )  Kaplan-Meier  analysis  of  the  probability  of  overall 
survival  according  to  differentiation  states  in  bladder  cancer  as  defined  by 
keratin  14  (K14)  in  two  independent  tissue  datasets,  Stanford  (A)  and  Baylor 
(B).  (C)  Representative  micrographs  of  K14  IHC  staining,  scoring  (0-3),  and 
stratification  (negative,  0-1;  positive,  2-3)  are  presented. 


muscle  invasive  (pT2  and  >pT2)  tumors.  Within  the  muscle  invasive 
cohort,  identification  of  high-risk  patients  may  allow  for  effective 
early  utilization  of  aggressive  therapies  like  neoadjuvant  chemo¬ 
therapy  and  provide  another  means  to  stratify  patients  in  clinical 
trials.  These  considerations  provide  strong  rationale  for  prospective 
studies  evaluating  KRT14  expression  as  a  risk-stratifying  marker. 

The  prognostic  utility  of  KRT14  held  when  tumors  were  ana¬ 
lyzed  by  both  gene  expression  and  IHC,  the  latter  being  a  tech¬ 
nique  easily  added  to  the  repertoire  of  clinical  laboratories. 
However,  our  IHC  analysis  identified  relatively  more  KRT14 
positive  patients  than  gene-expression  analysis.  There  are  two 
possible  explanations:  differences  in  patient  cohorts  and  assay 
sensitivity.  The  IHC  data  were  obtained  from  patients  treated  at 
Stanford  University  and  Baylor  College  of  Medicine,  which  are 
tertiary  referral  centers  that  commonly  treat  advanced-stage  BCs 
(59%  with  invasive  disease),  whereas  gene-expression  data  were 
obtained  from  patients  treated  at  multiple  different  European 
centers  (ranging  from  primary  to  tertiary  centers)  and  therefore 
had  overall  less  advanced  BCs  (19%  with  invasive  disease).  Ad¬ 
ditionally,  gene-expression  analysis  averages  mRNA  expression 
throughout  an  entire  sample,  whereas  IHC  provides  resolution  up 
to  a  single  cell.  Therefore,  the  same  patient  who  may  appear 
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KRT14  negative  in  a  gene-expression  analysis  may  be  identified 
through  IHC  as  KRT14  positive.  However,  the  fact  that  both  gene- 
expression  and  IHC  analyses  indicated  KRT14  as  an  independent 
prognostic  marker  speaks  to  the  robustness  of  this  early  progenitor 
cell  marker  in  BC  prognosis. 

In  addition  to  the  differences  between  gene  expression  and  IHC 
analysis,  the  nature  of  a  retrospective  study  has  its  own  limitations. 
For  example,  important  clinical  parameters  such  as  lymph  node 
status,  detailed  cytopathological  features,  and  full  treatment  his¬ 
tory  are  not  always  available.  Additionally,  the  distribution  of 
clinicopathological  features  in  the  study  cohorts  may  not  reflect 
the  natural  patient  distribution.  For  example,  carcinoma  in  situ 
cases  are  relatively  underrepresented  in  all  of  the  datasets  used  in 
this  study.  To  overcome  these  limitations,  the  clinical  utility  of 
KRT14  needs  to  be  validated  in  future  prospective  trials. 

In  summary,  we  have  developed  a  unique  computational 
strategy  to  identify  prognostic  markers  linked  to  cellular  differ¬ 
entiation.  We  subsequently  validated  a  set  of  distinct  differen¬ 
tiation  markers  in  BC  through  in  vivo  assays  and  clinical 
outcomes  analyses.  It  is  likely  that  this  method  can  be  readily 
generalizable  to  other  cancers.  Our  results  hold  immediate 
implications  to  understanding  BC  biology  and  further  de¬ 
velopment  of  unique  targeted  therapies.  Finally,  our  analysis 
revealed  a  clinically  applicable  marker,  KRT14,  which  we  believe 
is  an  ideal  candidate  for  a  large  prospective  trial  to  assess  risk- 
adapted  therapies. 
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Data  Collection,  Processing,  and  Statistical  Analysis.  See  SI  Methods  for 

further  details. 
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Correction  for  “Three  differentiation  states  risk-stratify  bladder 
cancer  into  distinct  subtypes,”  by  Jens-Peter  Volkmer,  Debashis 
Sahoo,  Robert  K.  Chin,  Philip  Levy  Ho,  Chad  Tang,  Antonina 
V.  Kurtova,  Stephen  B.  Willingham,  Senthil  K.  Pazhanisamy, 
Humberto  Contreras-Trujillo,  Theresa  A.  Storm,  Yair  Lotan, 
Andrew  H.  Beck,  Benjamin  I.  Chung,  Ash  A.  Alizadeh,  Guilherme 
Godoy,  Seth  P.  Lerner,  Matt  van  de  Rijn,  Linda  D.  Shortliffe, 
Irving  L.  Weissman,  and  Keith  S.  Chan,  which  appeared  in  issue 
6,  February  7,  2012,  of  Proc  Natl  Acad  Sci  USA  (109:2078-2083; 
first  published  January  19,  2012;  10.1073/pnas.ll20605109). 

The  authors  note  that  Robert  K.  Chin  should  be  listed  as  an 
additional  corresponding  author.  The  corrected  correspondence 
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Single-cell  dissection  of  transcriptional  heterogeneity 
in  human  colon  tumors 


Piero  Dalerba1,2,9,  Tomer  Kalisky3,9,  Debashis  Sahoo1,9,  Pradeep  S  Rajendran1,  Michael  E  Rothenberg1,4, 
Anne  A  Leyrat3,  Sopheak  Sim1,  Jennifer  Okamoto3,5,  Darius  M  Johnston1,3,5,  Dalong  Qian1,  Maider  Zabala1, 
Janet  Bueno6,  Norma  F  Neff3,  Jianbin  Wang3,  Andrew  A  Shelton7,  Brendan  Visser7,  Shigeo  Hisamori1, 

Yohei  Shimono1,  Marc  van  de  Wetering8,  Hans  Clevers8,  Michael  F  Clarke1,2,9  &  Stephen  R  Quake3,5,9 


Cancer  is  often  viewed  as  a  caricature  of  normal  developmental  processes,  but  the  extent  to  which  its  cellular  heterogeneity  truly 
recapitulates  multilineage  differentiation  processes  of  normal  tissues  remains  unknown.  Here  we  implement  single-cell  PCR 
gene-expression  analysis  to  dissect  the  cellular  composition  of  primary  human  normal  colon  and  colon  cancer  epithelia.  We  show 
that  human  colon  cancer  tissues  contain  distinct  cell  populations  whose  transcriptional  identities  mirror  those  of  the  different 
cellular  lineages  of  normal  colon.  By  creating  monoclonal  tumor  xenografts  from  injection  of  a  single  (n  =  1)  cell,  we  demonstrate 
that  the  transcriptional  diversity  of  cancer  tissues  is  largely  explained  by  in  vivo  multilineage  differentiation  and  not  only  by 
clonal  genetic  heterogeneity.  Finally,  we  show  that  the  different  gene-expression  programs  linked  to  multilineage  differentiation 
are  strongly  associated  with  patient  survival.  We  develop  two-gene  classifier  systems  (KRT20  versus  CA1,  MS4A12 ,  CD177 , 
SLC26A3)  that  predict  clinical  outcomes  with  hazard  ratios  superior  to  those  of  pathological  grade  and  comparable  to  those  of 
microarray-derived  multigene  expression  signatures. 


The  in  vivo  cellular  composition  of  solid  tissues  is  often  difficult  to 
investigate  in  a  comprehensive  and  quantitative  way.  Techniques 
such  as  immunohistochemistry  and  flow  cytometry  are  limited  by 
the  availability  of  antigen- specific  monoclonal  antibodies  and  by  the 
small  number  of  parallel  measurements  that  can  be  performed  on 
each  individual  cell.  Traditional  high-throughput  assays,  such  as  gene- 
expression  arrays,  when  performed  on  whole  tissues,  provide  infor¬ 
mation  on  average  gene  expression  levels,  and  can  be  correlated  only 
indirectly  to  quantitative  modifications  in  cellular  subpopulations. 
These  limitations  become  particularly  difficult  to  overcome  when 
studying  minority  populations,  such  as  stem  cells,  whose  iden¬ 
tification  is  made  elusive  by  their  low  numbers  and  by  the  lack  of 
exclusive  markers.  Moreover,  in  pathological  states,  such  as  cancer, 
it  is  usually  impossible  to  determine  whether  perturbations  in  gene 
expression  detected  in  whole  tissues  are  due  to  modifications  in 
the  relative  composition  of  different  cell  types  or  to  aberrations  in 
the  gene- expression  profile  of  mutated  cells. 

For  example,  although  it  has  been  postulated  that  multilineage 
differentiation  can  contribute  to  tumor  heterogeneity1-3,  this  issue 
remains  controversial4.  Many  in  the  field  view  cancer  heterogeneity 
mainly  as  the  result  of  clonal  evolution  secondary  to  genomic  instabi¬ 
lity5,6.  Previous  studies  addressed  this  question,  but  could  rely  only  on 
in  vitro  cultured  cell  lines  and  on  simple  morphological  evidence7-9. 


Moreover,  recent  evidence  indicates  that,  in  the  absence  of  a  molecular 
proof  of  monoclonal  origin,  results  from  in  vitro  experiments  based 
on  limiting  dilution  can  be  biased  due  to  a  dramatic  increase  in  cell 
survival  by  cell  hetero- doublets.  This  phenomenon  is  best  exemplified 
in  the  case  of  the  mouse  small  intestine,  where  growth  and  expansion 
of  LGR5+  progenitor  cells  is  dramatically  enhanced  by  the  presence  of 
bystander  epithelial  feeder  cells10.  Based  on  these  studies,  it  remained 
difficult  to  perform  a  quantitative  measure  of  the  degree  of  multiline¬ 
age  differentiation  in  cancer  tissues  and,  above  all,  to  investigate  to 
what  degree  it  actually  translated  into  the  differential  activation  of  dis¬ 
tinct  transcriptional  programs  that  would  mirror  and  recapitulate  the 
physiological  processes  observed  in  normal  tissues.  In  this  study  we 
developed  a  method  to  dissect  and  investigate  at  the  single-cell  level 
the  gene-expression  profile  of  the  distinct  cell  populations  contained 
in  primary  human  colon  epithelia,  both  normal  and  neoplastic. 

RESULTS 

Description  and  technical  validation  of  single-cell  PCR 

We  combined  fluorescence  activated  cell  sorting  (FACS)  and  single¬ 
cell  PCR  gene-expression  analysis  to  perform  a  high-throughput 
transcriptional  analysis  of  the  distinct  cellular  populations 
contained  in  solid  human  tissues  (Supplementary  Figs.  1  and  2). 
This  method  exploits  the  capacity  of  modern  flow  cytometers  to  sort 
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individual  single  cells  with  accuracy  and  precision  (Supplementary 
Fig.  3),  together  with  the  use  of  microfluidic  technologies  to  perform 
high- sensitivity  multiplexed  PCR  from  minute  amounts  of  mRNA, 
thereby  allowing  parallel  analysis  of  the  expression  of  up  to  96  genes 
for  each  individual  cell.  The  large  number  of  measurements  per  cell 
and  the  possibility  to  analyze  several  hundred  cells  in  parallel  from 
the  same  sample  allow  the  use  of  statistical  clustering  algorithms  to 
associate  cells  with  similar  gene  expression  profiles  into  well-defined 
subpopulations  (Supplementary  Fig.  2).  Microfluidic  platforms  have 
been  previously  validated  for  single-cell  gene- expression  analysis11-13. 
Consistent  with  those  results,  our  control  experiments  with  titrated 
mRNA  standards  as  well  as  single-cell  experiments  on  a  cell  line  vali¬ 
dated  the  sensitivity  of  this  approach  for  high-throughput  analysis 
across  multiple  genes  (Supplementary  Fig.  4). 

Analysis  of  normal  human  colon  epithelium 

We  first  applied  single-cell  PCR  to  the  study  of  normal  human  colon 
epithelial  cells.  Human  colon  epithelium  is  composed  of  heterogene¬ 
ous  populations  of  cells  that  express  different  protein  markers  based 
on  their  lineage,  differentiation  stage  and  functional  status.  Many  of 
these  cell  subsets  can  be  identified  by  immunohistochemistry  against 
well- characterized  markers,  such  as  MUC2,  expressed  by  goblet  cells; 
MKI67,  expressed  by  proliferating  cells;  KRT20  and  CEACAM1  (also 
known  as  CD66a),  preferentially  expressed  by  cells  at  the  top  of  the 
colonic  crypt  (Fig.  la-d)14. 

Under  normal  conditions,  immature  colon  epithelial  cells  reside 
at  the  bottom  of  the  colonic  crypts  (bottom-of-the-crypt  cells)  and 
express  high  levels  of  the  surface  marker  CD44,  whereas  differentiated 
mature  cells  progressively  migrate  to  the  top  (top-of-the-crypt  cells) 
and  progressively  lose  CD44  expression14,15.  We  focused  our  analysis 
on  the  stem  and  progenitor  cell  compartments  of  the  colonic  epi¬ 
thelium  by  sorting  the  EpCAMhlgh/CD44+  population  (Fig.  le,f;  P12) 
which,  in  normal  tissues,  corresponds  to  the  bottom  of  the  human 
colonic  crypt14.  To  study  the  more  mature,  terminally  differentiated 
cell  populations,  we  sorted  and  analyzed  an  equal  number  of  cells 
from  the  EpCAM+/CD44-/CD66ahlgh  population,  which  corresponds 
to  the  top  of  the  human  colonic  crypt  (Fig.  le,f;  Pll)16. 

We  first  tested  the  ability  of  single-cell  PCR  gene-expression  ana¬ 
lysis  to  distinguish  different  cell  populations  using  well-established 
Sjjk  reference  markers.  We  analyzed  and  clustered  colon  epithelial  cells 
\s9  using  three  genes  encoding  markers  linked  to  either  one  of  the  two 
major  cell  lineages  (that  is,  MUC2  for  goblet  cells  and  CA1  for  ente- 
rocytes)  or  the  immature  compartment  (that  is,  LGR5)  of  the  colon 
epithelium14,17-19.  This  experiment  showed  that  genes  encoding 
lineage-specific  markers  are  frequently  expressed  in  a  mutually  exclu¬ 
sive  way,  mirroring  the  expression  pattern  of  corresponding  proteins 
(Supplementary  Fig.  5). 

We  then  searched  for  gene- expression  markers  of  the  different  cell 
populations,  with  a  special  focus  on  putative  stem  cell  markers.  We 
mined  1,568  publicly  available  gene- expression  array  data  sets  from 
human  colon  epithelia  (Supplementary  Table  1),  using  a  bioinformatics 
approach  designed  to  identify  developmentally  regulated  genes  based  on 
Boolean  implication  logic  (Supplementary  Fig.  6)20.  The  search  yielded 
candidate  genes  whose  expression  was  associated  with  that  of  other 
markers  previously  linked  to  individual  colon  epithelial  cell  lineages 
(Supplementary  Figs.  7-9).  Using  an  iterative  approach,  we  screened 
>230  genes  on  eight  independent  samples  of  normal  human  colon  epithe¬ 
lium  by  single- cell  PCR  gene- expression  analysis.  At  each  round,  genes 
that  were  noninformative  (that  is,  not  differentially  expressed  in  either 
positive  or  negative  association  with  CARMUC2  or  LGR5)  were  removed 
and  replaced  with  new  candidate  genes.  Thereby,  we  progressively 


built  a  list  of  57  TaqMan  assays  that  allowed  us  to  analyze  the  expression 
pattern  of  53  distinct  genes  (3  housekeeping,  3  proliferation-related  and 
47  differentially  expressed  genes;  Supplementary  Table  2)  with  high 
robustness  (Supplementary  Fig.  10).  This  allowed  us  to  characterize 
multiple  cell  populations,  using  both  hierarchical  clustering  (Fig.  lg) 
and  principal  component  analysis  (PCA;  Fig.  lh,i). 

Analysis  of  the  EpCAM+/CD44-/CD66ahlgh  population  (enriched 
for  top-of-the-crypt  cells)  revealed  that  this  subset,  although  tran¬ 
scriptionally  heterogeneous,  was  almost  exclusively  composed  of  cells 
expressing  high  levels  of  genes  characteristic  of  mature  enterocytes 
(e.g.,  CAi+,  CA2+ ,  KRT20+ ,  SLC26A3+,AQP8+  and MS4A12+)U- 21-23 
and  led  to  the  discovery  of  at  least  two  gene  expression  markers 
whose  differential  expression  pattern — to  our  knowledge — has  not 
been  reported  before  ( CD177  and  GUCA2B )  (Fig.  lg).  To  validate 
the  reliability  of  single-cell  PCR  gene-expression  analysis  results,  we 
evaluated  the  distribution  of  SLC26A3  and  CD  177  protein  expression 
in  tissue  sections  and  we  confirmed  its  preferential  expression  at  the 
top  of  the  human  colonic  crypts  (Supplementary  Figs.  11  and  12). 

We  could  also  distinguish  different  subsets  of  cells  with  different 
transcriptional  profiles  within  the  EpCAM+/CD44-/CD66ahlgh  popu¬ 
lation  (e.g.,  CA1+/SLC26A3+  versus  GUCA2B+).  At  the  present  time, 
it  is  not  clear  whether  they  represent  distinct  stages  of  differentiation 
or  distinct  functional  subsets  of  colonic  enterocytes.  Nonetheless, 
their  clearly  unique  transcriptional  programs  identify  them  as  part 
of  a  distinct  cellular  population. 

Analysis  of  the  EpCAMhlgh/CD44+  population  (enriched  for 
‘bottom-of-the-crypt’  cells)  revealed  the  presence  of  multiple 
populations,  including:  (i)  a  cell  compartment  characterized  by 
the  expression  of  genes  linked  to  goblet  cells  ( MUC2+ ,  TRF3hlgh, 
SPDEF+ ,  SPINK4+)24,25,  (ii)  a  cell  compartment  characterized  by 
the  co-expression  of  genes  associated  with  immature  cells  as  well  as 
genes  known  to  be  expressed  by  enterocytes  ( OLFM4+ ,  CA2hlgh)  and 
(iii)  a  cell  compartment  whose  gene-expression  profile  mirrors  that 
of  a  stem  and/or  progenitor  cell  compartment  in  the  mouse  small 
intestine  (LGR5+,  ASCL2+ ,  PTPRO+,  RGMB+)17,26.  A  synopsis  of 
the  key  genes  that  define  the  gene- expression  profile  of  the  different 
populations  is  provided  in  Supplementary  Table  3. 

The  OLMF4+/CA2hiZh  and  the  LGR5+/ASCL2+  compartments 
shared  expression  of  several  genes  of  functional  interest  in  both  stem 
cell  and  cancer  biology,  such  as  genes  involved  in  self-renewal  and 
chromatin  remodeling  ( EZH2 ,  BMJi)27-29,  Wnt-pathway  signaling 
(. AXIN2 )30,  cell  growth  and  chemotaxis  (CXCL2)31,  stem  cell  qui¬ 
escence  ( LRIG1 )32  and  oncogenes  (M7C)33.  The  expression  of  pro¬ 
liferation  markers,  such  as,  MKI67 ,  TOP2A ,  BIRC5  (also  known  as 
Survivin)  appeared  to  be  restricted  to  the  EpCAMhlgh/CD44+  (bottom- 
of-the-crypt)  population  and  particularly  to  the  LGR5+/ASCL2+  and 
MUC2+/TFF3hi8h  cells.  This  was  partially  expected  based  on  both 
previously  published  data14,17,19  and  our  own  immunohistochemistry 
results  (Supplementary  Fig.  13c). 

We  also  observed  that  MUC2+/TFF3hl%h  cells  were  characterized  by 
high  expression  levels  of  several  genes  of  interest,  including  DLL1  and 
DLL4 ,  encoding  for  two  Notch  ligands,  and  KRT20.  The  expression 
of  KRT20  at  the  bottom  of  the  crypt  appears  contrary  to  the  notion  of 
KRT20  as  a  terminal  differentiation  marker.  However,  a  more  careful 
examination  of  immunohistochemical  stainings  identified  scattered 
KRT20+  cells,  which  can  be  morphologically  identified  as  goblet  cells 
(Supplementary  Fig.  13a,b).  We  also  noticed  that  MUC2+/TFF3hi8h 
cells,  for  the  most  part,  did  not  express  CFTR ,  the  gene  mutated  in 
cystic  fibrosis.  The  differential  expression  of  DLL4  is  of  potential  rele¬ 
vance  to  the  clinical  development  of  novel  anti-tumor  therapeutic 
agents  directed  against  this  molecule34. 
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Figure  1  Single-cell  PCR  gene-expression  analysis  of  human  normal  colon  epithelium,  (a-d)  Immunohistochemistry  of  normal  human  colon  epithelium, 
stained  for  MUC2  (a),  labeling  goblet  cells,  MKI67  (b),  labeling  proliferating  cells,  KRT20  (c)  and  CEACAM1  (d),  preferentially  labeling  top-of-the-crypt 
cells.  (e,f)  Flow  cytometry  sorting  strategy  for  top-of-the-crypt  and  bottom-of-the-crypt  epithelial  cells,  (e)  Colon  epithelial  cells,  both  CD44ne§  and 
CD44+,  were  separated  from  stromal  cells  based  on  their  EpCAM+  phenotype,  (f)  Bottom-of-the-crypt  epithelial  cells  were  defined  as  EpCAMhl§h/CD44+ 
(f,  P12  blue  sort  gate)  and  top-of-the-crypt  epithelial  cells  as  EpCAM+/CD44_/CD66ahi§h  (f,  Pll  orange  sort  gate),  (g)  Hierarchical  clustering  of 
single-cell  PCR  gene-expression  analysis  data  visualized  distinct  cell  populations,  including  enterocyte-like  cells  ( CA1+/SLC26A3+  and  GUCA2B+), 
goblet-like  cells  ( MUC2+/TFF3high )  and  two  compartments  defined  by  gene-expression  profiles  reminiscent  of  more  immature  progenitors  ( OLFM4+ / 
CA2high  and  LGR5+/ASCL2+ ).  (h,i)  Principal  component  analysis  of  single-cell  PCR  gene-expression  data  visualized  different  cell  types  and  different 
gene  families.  Different  cell  types  were  characterized  by  different  scores  along  the  two  main  principal  components  (PCI  and  PC2)  (h).  Different 
gene  families  were  characterized  by  different  contributions  to  the  two  main  principal  components.  To  allow  comparisons  between  hierarchical  clustering 
and  PCA  results,  we  displayed  each  cell  or  gene  in  PCA  plots  with  the  color  corresponding  to  the  cell  type  or  gene  family  it  was  assigned  to  based  on 
hierarchical  clustering  (i). 
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Analysis  of  a  primary  human  colon  adenoma 

We  then  turned  to  cancer  and  investigated  whether  the  cellular 
composition  of  the  normal  colonic  epithelium  is  preserved 
in  colorectal  tumors,  both  benign  and  malignant.  Analysis  by 
single-cell  PCR  gene-expression  analysis  of  EpCAMhl§h/CD44+ 
cells  from  a  primary  tubulo-villous  adenoma  (sample  name: 
SU-COLON#76;  Supplementary  Table  4)  revealed  the  presence 


of  at  least  two  different  cell  populations  (that  is,  LGR5+/ASCL2+ 
and  MUC2+/TFF3hl£h)  characterized  by  distinctive  gene  signatures, 
closely  mirroring  the  subpopulations  observed  in  corresponding 
EpCAMhlSh/CD44+  populations  of  normal  tissues  (Fig.  2a-c). 

These  observations  were  confirmed  at  the  protein  level  by  par¬ 
allel  immunohistochemical  investigations  for  KRT20  and  MUC2 
(Fig.  2d,e)  and  are  in  agreement  with  the  recent  finding  that 


SU-COLON#76 

(primary,  tubulo-villous  adenoma) 


KRT20  MUC2 


Figure  2  Single-cell  PCR  gene-expression  analysis  of  human  colon  tumor  tissues,  (a)  Hierarchical  clustering  of  single-cell  PCR  gene-expression  data  from  the 
EpCAM+/CD44+  population  of  a  large  primary  benign  adenoma  (sample:  SU-COLON#76;  see  Supplementary  Table  4).  The  analysis  revealed  the  presence  of 
multiple  cell  populations  characterized  by  distinct  gene  signatures,  closely  mirroring  lineages  and  differentiation  stages  observed  in  the  EpCAM+/CD44+  population 
from  the  normal  colon  epithelium.  (b,c)  Principal  component  analysis  (PCA)  of  single-cell  PCR  gene-expression  analysis  data  confirmed  hierarchical  clustering 
results,  visualizing  cell  types  (b)  and  gene  families  (c)  similar  to  those  identified  in  normal  tissues.  (d,e)  Gene-expression  data  were  confirmed  at  the  protein  level 
by  immu nohistochemistry,  testing  for  expression  of  KRT20  (d)  and  MUC2  (e)  on  corresponding  tissue  sections,  (f-j)  A  similar  study  on  a  monoclonal  colon  cancer 
xenograft  obtained  from  injection  of  a  single  (/?=  1)  cell  in  a  NOD/SCI  D/I  L2R^/_  mouse  (UM -CO LON #4  clone  8)  produced  similar  results  in  terms  of  hierarchical 
clustering  (f),  cell  types  identified  by  PCA  (g),  gene  families  identified  by  PCA  (h),  immunohistochemistry  results  for  KRT20  (i)  and  immunohistochemistry  results 
for  MUC2  (j).  Results  from  the  monoclonal  tumor  xenograft  indicated  that  the  distinct  cell  populations  visualized  by  single-cell  PCR  did  not  arise  as  the  result  of 
the  coexistence  within  the  tumor  tissue  of  independent  genetic  subclones,  but  as  the  result  of  multilineage  differentiation  processes  during  tumor  growth.  Color 
coding  of  normalized  threshold  cycle  (Ct)  values  in  hierarchical  clustering  plots  and  of  gene  families  in  PC  loading  plots  are  identical  to  those  of  Figure  1. 
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Figure  3  Analysis  of  a  monoclonal  human 
colon  cancer  xenograft  obtained  from  injection 
of  a  single  (/?  =  1)  cell  in  NOD/SCI  D/I  L2R^/_ 
mice,  (a)  In  human  colon  cancer,  the 
frequency  of  EpCAMhlgh/CD44+  cells  capable 
to  establish  a  tumor  upon  xenotransplantation 
in  NOD/SCI  D/I  L2Ry_/_  mice  varies  based  on 
the  xenograft  line,  as  shown  by  comparative 
limiting-dilution  experiments,  (b)  Single  (/?  =1) 
lentivirus-infected  EGFP+/EpCAMhigh/CD44+ 
cancer  cells  can  be  sorted  by  flow  cytometry 
for  injection  in  mice.  (c,d)  Analysis  by  flow 
cytometry  of  a  monoclonal  tumor  derived  from 
injection  of  a  single  (n  =  1),  lentivirus-tagged, 
EGFP+/EpCAMhigh/CD44+  cancer  cell  from  the 
human  colon  cancer  xenograft  UM-C0L0N#4 
(clone  8)  confirmed  that  human  cells  expressed 
EGFP  (c)  and  contained  both  EpCAMlow/CD44- 
and  EpCAMhlgh/CD44+  populations  (d).  (e)  The 
monoclonal  origin  of  the  UM -CO LON #4  clone 
8  tumor  was  confirmed  by  LM-PCR,  showing 
the  presence  of  a  unique  lentivirus  integration 
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site  in  both  EGFP+/EpCAMlow/CD44-  and 

EGFP+/EpCAMhlgh/CD44+  populations,  contrary  to  what  was  observed  in  its  polyclonal  parent  tumor.  A  larger  image  of  the  LM-PCR  gel  is  provided  in 
Supplementary  Figure  24.  (f,g)  Immunohistochemistry  of  monoclonal  tumor  tissues  revealed  heterogeneous  and  mutually  exclusive  expression  patterns 
of  KRT20  (f)  and  MKI67  (g).  (h)  Similar  to  what  is  observed  in  parent  tumors,  EpCAMhlgh/CD44+  and  EpCAMlow/CD44_  populations  from  UM-C0L0N#4 
clone  8  were  characterized  by  different  tumorigenic  capacity,  as  evaluated  by  tumorigenicity  experiments  in  NOD/SCI D/ 1 L2R^r/  mice. 


KRT20  is  frequently  expressed  in  a  mutually  exclusive  pattern 
with  respect  to  LGR5  (ref.  19).  This  primary  adenoma  appeared 
depleted  in  CAl+/SLC26A3+y  GUCA2B+ and  OLFM4+/CA2hi8h  cell 
populations.  A  careful  examination  of  public  gene-expression  array 
databases  indicated  that  this  unexpected  feature  is  likely  common 
to  many  benign  adenomas  (Supplementary  Fig.  14). 

Analysis  of  a  human  colon  cancer  xenograft  derived  from  a 
single  cancer  cell 

Tumor  tissues,  both  benign  and  malignant,  are  known  to  undergo 
perturbations  of  normal  differentiation  processes,  but  it  is  unclear  to 
what  extent  those  perturbations  reflect  quantitative  changes  in  cell 
composition  or  qualitative  changes  in  gene- expression  programs.  This 
topic  has  historically  been  controversial4-9,35.  Our  own  systematic 
study  of  KRT20  and  MUC2  protein  expression  in  human  malignant 
colorectal  cancer  tissues,  for  instance,  revealed  that  both  markers  are 
frequently  expressed  heterogeneously,  in  patterns  that  mirror  those 
observed  in  normal  colorectal  epithelium  (Supplementary  Fig.  15). 
It  remained  unclear,  however,  to  what  extent  cancer  transcriptional 
heterogeneity  is  the  result  of  clonal  genetic  heterogeneity36  or  epige¬ 
netic  heterogeneity  due  to  multilineage  differentiation  processes9. 

To  address  this  question  from  a  functional  perspective,  we  investi¬ 
gated  whether  a  single  (n  =  1)  human  colorectal  cancer  cell  can  recreate 
the  heterogeneous  cell  composition  of  parent  tumor  tissues,  including 
the  subpopulations  that  we  discovered  in  this  study.  We  injected  NOD/ 
SCID/IL2Ry-/_mice  with  single  (w  =1)  EpCAMhigh/CD44+  cancer  cells 
purified  by  flow  cytometry  from  one  of  our  well-characterized  solid 
xenograft  lines37,  following  infection  with  a  lentivirus  vector  encoding 
enhanced  green  fluorescence  protein  (EGFP;  Fig.  3a, b). 

Notably,  the  single  cell-derived,  lentivirus-tagged,  EGFP+  xenograft 
line  generated  in  this  experiment  (UM-COLON#4  clone  8)  closely 
reproduced  the  phenotypic  diversity  of  its  parent  tumor  both  in  terms 
of  tissue  histology  (Figs.  2i,j  and  3f,g)  and  surface-marker  phenotypic 
repertoire  of  cellular  populations  (Fig.  3c, d).  The  lines  monoclonal 
origin  was  confirmed  by  identification  of  a  unique  lentivirus  integra¬ 
tion  site  in  all  cancer  cells  (Fig.  3e). 


Tumorigenicity  experiments  done  in  NOD/SCID/IL2Ry-/-  mice 
revealed  that,  as  observed  in  the  parent  tumors37,  EGFP+/EpCAMhl§h/ 
CD44+  and  EGFP+/EpCAMlow/CD44-/low  cell  populations  were 
endowed  with  different  tumorigenic  capacity  (Fig.  3h).  A  single-cell 
PCR  gene-expression  analysis  of  the  EpCAMhl8h/CD44+  population 
from  these  monoclonal  tumors  demonstrated  its  heterogeneous  line¬ 
age  composition,  showing  the  presence  of  three  distinct  compartments 
(that  is,  LGR5+/ASCL2+ ,  OLFM4+/CA2hi%\  MUC2+/TFF3hiZh ), 
again  characterized  by  distinctive  gene  signatures,  closely  mirroring 
those  observed  in  corresponding  immature  populations  of  normal 
tissues  (Fig.  2f-h). 

Taken  together,  these  data  formally  prove  that,  in  a  subset  of 
tumors,  transcriptional  heterogeneity  is,  at  least  partly,  explained  by 
multilineage  differentiation  processes  that  tend  to  recapitulate  those 
observed  in  normal  tissues. 

Prognostic  role  of  biomarkers  identified  by  single-cell  PCR 

To  gain  further  insight  into  the  potential  functional  implications  of 
these  observations,  we  compared  the  gene-expression  pattern  of  genes 
associated  with  cell  proliferation  (that  is,  MKI67,  TOP2A  and  BIRC5) 
in  normal  and  cancer  tissues.  In  this  case  too,  we  observed  that  the 
expression  pattern  observed  in  malignant  tissues  frequently  mirrored 
that  of  normal  ones. 

Both  in  the  normal  tissue  and  in  the  monoclonal  human  colon 
cancer  xenograft,  for  instance,  all  three  proliferation  markers  were 
frequently  expressed  in  a  mutually  exclusive  way  as  compared  to  the 
differentiation  marker  KRT20  (Supplementary  Fig.  16).  This  obser¬ 
vation  was  subsequently  confirmed  at  the  protein  level  by  a  systematic 
study  of  MKI67  and  KRT20  expression  in  serial  sections  from  seven 
human  colorectal  cancer  tissues,  where  MKI67  expression  was  often 
inversely  associated  with  KRT20  (Supplementary  Fig.  17). 

These  observations  suggest  that,  in  at  least  some  cases,  bulk  short¬ 
term  tumor  growth  is  principally  driven  by  a  specific  subset  of  the 
cancer  cell  population,  characterized  by  a  gene-expression  repertoire 
characteristic  of  more  immature  cell  compartments.  This  concept  has 
important  implications  for  the  modeling  of  tumor  growth  kinetics 
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Figure  4  KRT20and  top-crypt  genes  can  be  used  as  prognostic  markers  in  colorectal  cancer  patients,  (a-d)  We  used  the  Hegemon  software  to  graph 
individual  arrays  according  to  the  expression  levels  of  KRT20  and  one  of  four  genes  characteristic  of  top-of-the-crypt  CA1+/SLC26A3+  enterocyte-like 
cells:  KRT20  versus  CA1  (a),  KRT20  versus  MS4A12{b),  KRT20  versus  CD177  { c),  KR T20  versus  SLC26A3  (d).  We  used  the  StepMiner  algorithm  to 
define  gene-expression  thresholds  and  identify  three  distinct  gene-expression  groups:  Group  1  (green),  defined  as  KRT20+/CAlhigh,  KRT20+/MS4A12hlgh, 
KRT20+/CD1 77+  or  KRT20+/SLC26A3+,  respectively;  Group  2  (blue),  defined  as  KRT20+/CA1~/Iow,  KRT20+/MS4A12~/Iow,  KRT20+/CD177~  or 
KRT20+/SLC26A3-,  respectively;  Group  3  (red),  defined  as  KRT20-/CA1~/Iow,  KRT20~/MS4A12-/Iow,  KRT20~/CD177-  or  KRT20-/SLC26A3~, 
respectively,  (e-h)  Survival  analysis  using  Kaplan-Meier  curves  showed  that,  in  all  four  cases,  an  increasingly  immature  gene-expression  profile 
corresponded  to  a  progressively  worse  prognosis,  (i-l)  Multivariate  analysis  of  survival  data  based  on  the  Cox  proportional  hazards  model  indicated 
that  the  prognostic  effect  of  these  two-gene  classifiers  was  not  confounded  by  clinical  stage,  age  or  sex.  The  analysis  was  performed  on  a  pooled 
database  of  299  primary  colon  cancer  gene-expression  arrays  annotated  with  disease-free  survival  (DFS)  data41’42  (Supplementary  Table  1).  *P<  0.05, 
**P<  0.001.  Age  modeled  as  a  continuous  variable.  HR,  hazard  ratio;  Cl,  confidence  interval;  M,  male;  F,  female. 


and  the  response  to  anti-tumor  drugs  in  different  experimental  set¬ 
tings.  Although  very  common,  this  feature  is  not  absolute,  as  we  have 
0{k  observed  exceptions  characterized  either  by  homogenous  expression 
of  KRT20  in  almost  the  entirety  of  the  malignant  epithelium  or  by 
complete  absence  of  it  in  selected  human  tumors  (Supplementary 
Fig.  17,  samples  SU87  and  SU98,  respectively).  In  accordance  with 
our  model,  tumors  characterized  by  the  complete  absence  of  KRT20 
expression  were  very  poorly  differentiated  and  contained  high  per¬ 
centages  ofMKI67+  cells  (Supplementary  Fig.  17,  SU98). 

We  next  tested  whether  these  insights  in  the  functional  anatomy 
of  the  colon  epithelium  could  have  clinically  useful  applications. 
We  evaluated  whether  quantitative  expression  levels  of  genes  asso¬ 
ciated  with  differentiation  processes  could  be  used  as  a  substitute 
measure  for  the  cellular  composition  of  the  corresponding  tumors 
and  thereby  serve  to  stratify  colon  cancer  patients  and  predict 
clinical  outcome.  Our  single-cell  PCR  gene-expression  analysis 
data  identified  a  set  of  sensitive  and  exclusive  markers  of  top-of- 
the-crypt  CA 1 +/SL C26A3+  cells  (that  is,  CAI,  MS4A12 ,  CD177 , 
SLC26A3).  It  also  implicated  KRT20  as  a  more  promiscuous  differ¬ 
entiation  marker,  whose  expression  is  high  in  CA1+/SLC26A3+  cells 
and  a  subset  of  MUC2+/TFF3hi8h  cells,  is  absent  in  LGR5+/ASCL2+ 
cells,  and  is  inversely  associated  with  that  of  proliferation  markers 
( MKI67 ,  TOP2A ,  BIRC5).  In  addition,  KRT20  expression  can  be 
easily  detected  by  immunohistochemistry  and  is  commonly  used 


as  a  diagnostic  marker  in  surgical  pathology38,  thus  representing 
an  attractive  candidate  for  further  clinical  applications39. 

Our  first  analysis  of  a  pool  of  1,568  independent  human  colon 
gene- expression  arrays  revealed  that  expression  levels  of  genes  charac¬ 
teristic  for  the  CA1+/SLC26A3+  cell  population  are  strongly  correlated 
(Supplementary  Fig.  18).  The  relationship  between  the  expression  of 
these  top-of-the-crypt  genes  and  KRT20  was  described  by  a  Boolean 
implication:  tumors  expressing  high  levels  of  top-of-the-crypt  genes 
( top-crypthi£h )  were  always  KRT20+ ,  whereas  tumors  expressing 
low-to-negative  levels  of  top-of-the-crypt  genes  (top-crypt-/low) 
could  be  clearly  separated  into  two  groups:  KRT20+  and  KRT20~ 
(Supplementary  Fig.  7).  Importantly,  KRT20~  tumors  expressed  high 
levels  of  ALCAM/CD166  (Supplementary  Fig.  19),  a  gene  encoding 
for  a  surface  marker  characteristic  of  colon  cancer  cells  with  high 
tumorigenic  potential  in  mouse  xenotransplantation  experiments37. 

We  developed  software  (‘hierarchical  exploration  of  gene  expres¬ 
sion  microarrays  on-line’,  or  Hegemon)  to  analyze  the  survival 
outcomes  of  human  colon  cancer  patients  after  stratification  into 
distinct  gene-expression  subsets,  based  on  the  expression  of  KRT20 
and  one  of  the  marker  genes  of  CA1+/SLC26A3+  top-of-the-crypt 
cells  (Fig.  4a-d).  These  subsets,  or  gene-expression  groups,  were 
numbered  from  more  to  less  mature  (group  1,  KRT20+/top-crypthi%h; 
group  2,  KRT20+/top-crypr/low;  group  3,  KRT20~ / top -cry pt~/low).  We 
used  a  computer-assisted  method  to  determine  the  threshold  level 
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Table  1  The  prognostic  effect  of  KRT20/top-crypt  gene-expression 
groups 


HRa 

95%  Clb 

Pvalue 

KRT20/CA1 

Prognostic  variable 

Group  (1-3)  krt20/ca1 

2.93 

1.37-6.27 

0.0056* 

Grade  (G1-G4) 

1.09 

0.58-2.04 

0.80 

Stage  (l-IV) 

3.43 

2.20-5.34 

<0.001** 

Agec 

0.99 

0.97-1.01 

0.43 

Sex  (M/F)d 

1.18 

0.86-1.61 

0.31 

KRT20/MS4A12 

Prognostic  variable 

Group  (1-3)  krt20/ms4a12 

2.93 

1.37-6.28 

0.0057* 

Grade  (G1-G4) 

1.07 

0.57-2.00 

0.84 

Stage  (l-IV) 

3.41 

2.19-5.31 

<0.001** 

Agec 

0.99 

0.97-1.01 

0.41 

Sex  (M/F)d 

1.19 

0.87-1.63 

0.28 

KRT20/CD177 

Prognostic  variable 

Group  (1-3)  krt20/cd177 

1.94 

0.97-3.90 

0.062 

Grade  (G1-G4) 

1.19 

0.63-2.22 

0.59 

Stage  (l-IV) 

3.21 

3.03-7.06 

<0.001** 

Agec 

0.99 

0.97-1.01 

0.39 

Sex  (M/F)d 

1.20 

0.87-1.64 

0.26 

KRT20/SLC26A3 

Prognostic  variable 

Group  (1-3)  krt20/slc26a3 

2.36 

1.14-4.88 

0.021* 

Grade  (G1-G4) 

1.12 

0.60-2.10 

0.72 

Stage  (l-IV) 

3.34 

2.16-5.15 

<0.001** 

Agec 

0.99 

0.97-1.01 

0.45 

Sex  (M/F)d 

1.19 

0.87-1.63 

0.27 

Multivariate  analysis  based  on  the  Cox  proportional  hazards  model,  testing  the 
KRT20/top-crypt  two-gene  scoring  systems  in  parallel  with  pathological  grading,  clinical 
stage,  age  and  sex,  using  the  KRT20/CA1  two-gene  classifier,  the  KRT20/MS4A12 
two-gene  classifier,  the  KRT20/CD177  two-gene  classifier  or  the  KRT20/  SLC26A3  two- 
gene  classifier.  Contrary  to  pathological  grade,  KRT20/top-crypt  gene  expression  groups 
were  associated  with  statistically  significant  (p  <  0.05)  hazard  ratios  (HR),  with  the 
only  exception  of  the  KRT20/CD177  two-gene  classifier.  The  analysis  was  performed  on 
a  subset  database  of  181  microarrays  annotated  with  grading  information  (database 
from  ref.  42,  n  =  181,  see  Supplementary  Table  1).  *,  P<  0.05;  **,  P  <  0.001. 
aHR,  hazard-ratio.  bCI,  confidence  interval.  cAge  modeled  as  a  continuous  variable.  dM/F, 
male  versus  female. 


between  positive  and  negative  expression,  based  on  the  StepMiner 
algorithm  (Supplementary  Fig.  20)40,  and  compared  the  clinical 
outcome  of  colon  cancer  patients  in  the  three  groups,  using  a  pool 
of  three  independent  data  sets,  containing  299  patients  at  different 
clinical  stages  (either  AJCC  stage  I-IV  or  Dukes  stage  A-D)  from 
the  H.  Lee  Moffit  Cancer  Center,  the  Vanderbilt  Medical  Center  and 
the  Royal  Melbourne  Hospital41,42,  all  of  which  were  annotated  with 
disease-free  survival  (DFS)  data. 

The  three  patient  groups  identified  by  these  simple  two-gene 
classifiers  displayed  substantially  different  clinical  outcomes.  An 
increasingly  immature  gene- expression  profile  corresponded  to  a  pro¬ 
gressively  worse  prognosis  (Fig.  4e-h).  This  result  was  independent 
of  the  gene  chosen  as  marker  of  CA1+/SLC26A3+  cells  (that  is,  CA1, 
MS4A12 ,  CD  177,  SLC26A3)  and  a  multivariate  analysis  indicated  that 
the  prognostic  value  of  the  two-gene  grouping  system  was  not  con¬ 
founded  by  stage  or  other  clinical  variables  (Fig.  4i-l). 

Tumors  with  a  more  immature  gene-expression  profile  (group  3, 
KRT20~/top-crypt~/low)  were  more  likely  to  be  of  high  pathological 
grade  (G3-G4;  Supplementary  Fig.  21)  and  of  microsatellite  instability 
status  (MSI;  Supplementary  Fig.  22).  These  enrichments,  however, 
did  not  confound  the  prognostic  value  of  the  two-gene  classifier 
system,  as  the  high  hazard-ratios  associated  with  more  immature  gene- 
expression  groups  remained  statistically  significant  (P  <  0.05),  when 
tested  against  pathological  grade  in  multivariate  analysis  (Table  1; 


with  the  exception  of  KRT20/CD177 ,  P  =  0.06),  and  because  MSI+ 
tumors  are  known  to  be  usually  associated  with  a  better  prognosis43. 
The  prognostic  effect  of  the  two-gene  classifier  system  was  also  inde¬ 
pendent  of  the  recently  described  multigene  EphB2  intestinal  stem 
cell  signature19,  and  was  associated  with  comparable,  if  not  superior, 
hazard  ratios  (Supplementary  Fig.  23). 

DISCUSSION 

In  this  study,  we  implemented  a  method  to  investigate  the  cellular 
composition  of  solid  tissues  based  on  high-throughput  parallel  ana¬ 
lysis  of  the  gene-expression  repertoire  of  single  cells  sorted  by  flow 
cytometry.  We  used  this  methodology  to  identify  distinct  cellular  sub¬ 
sets  of  the  human  colon  epithelium  and  to  discover  gene  expression 
markers  to  define  them.  We  then  examined  human  colorectal  tumors, 
both  benign  and  malignant,  and  characterized  them  in  terms  of  cell 
lineage  composition  and  maturation.  We  showed  that  tumor  tissues 
contain  multiple  cell  types  whose  transcriptional  identities  mirror 
those  of  the  cellular  lineages  of  the  normal  epithelium.  Moreover,  we 
showed  that  tumor  tissues  generated  from  a  single  cell  can  recapitulate 
the  lineage  diversity  of  parent  tumors,  demonstrating  that  multiline¬ 
age  differentiation  represents  a  key  source  of  in  vivo  functional  and 
phenotypic  cancer  cell  heterogeneity. 

Using  these  concepts  as  a  guide,  we  identified  biological  subsets  of 
human  colorectal  cancer,  based  on  the  expression  of  genes  charac¬ 
teristic  of  specific  cell  types.  These  biological  subsets  were  associated 
with  substantially  different  clinical  outcomes  and  could  be  identified 
by  a  simple  two-gene  classifier  system.  This  prognostic  scoring  sys¬ 
tem  appeared  independent  of  and  superior  to  pathological  grading, 
which  is,  to  this  date,  one  of  the  few  parameters  incorporated  into  the 
design  of  therapeutic  algorithms  for  colon  cancer  patients44.  Owing 
to  its  simplicity  and  quantitative  nature,  this  two -gene  scoring  system 
has  the  potential  to  move  beyond  the  realm  of  purely  experimental 
medicine  and  is  a  viable  candidate  for  clinical  applications. 

METHODS 

Methods  and  any  associated  references  are  available  in  the  online 
version  of  the  paper  at  http://www.nature.com/naturebiotechnology/. 

Note:  Supplementary  information  is  available  on  the  Nature  Biotechnology  website. 
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ONLINE  METHODS 

Human  primary  tissues  and  colon  cancer  xenografts.  Human  primary 
colon  tissues,  normal  or  malignant,  were  collected  according  to  guidelines 
from  Stanford  University’s  institutional  review  board.  Human  colon  cancer 
tissues  used  in  this  study,  either  from  primary  samples  or  xenograft  lines,  are 
listed  in  Supplementary  Table  4,  together  with  clinical  information  relative 
to  corresponding  patients.  Human  colon  cancer  xenograft  lines  were  estab¬ 
lished  and  serially  passaged  in  immunodeficient  mice  following  previously 
published  protocols37.  A  detailed  description  of  these  protocols  is  provided 
in  the  Supplementary  Methods. 

Cell  lines.  Calibration  experiments  to  measure  accuracy  and  precision  of 
single-cell  sorting  by  flow  cytometry,  as  well  as  to  measure  single-cell  sensi¬ 
tivity  of  single-cell  PCR,  were  performed  on  a  clone  of  the  HCT116  human 
colon  cancer  cell  line  infected  with  the  pLL3.7  lentivirus  (Addgene  no.  11795). 
HCT1 16  cells  are  available  from  the  American  Tissue-type  Culture  Collection 
(ATCC;  CCL-247)  and  were  maintained  in  RPMI-1640  medium,  supplemented 
with  10%  heat-inactivated  fetal  bovine  serum,  2  mM  L-glutamine,  120  jig/ml 
penicillin,  100  pg/ml  streptomycin,  20  mM  HEPES  and  1  mM  sodium  pyru¬ 
vate,  as  previously  described45. 

Solid  tissue  disaggregation.  Solid  tissues,  normal  and  neoplastic,  collected 
from  primary  surgical  specimens  or  mouse  xenografts,  were  mechanically 
and  enzymatically  disaggregated  into  single-cell  suspensions,  following  pre¬ 
viously  published  protocols37.  Briefly,  solid  tissues  were  minced  into  small 
chunks  (2  mm3),  rinsed  with  Hank’s  balanced  salt  solution  (HBSS),  finely 
chopped  with  a  razor  blade  into  minute  fragments  (0.2-0.5  mm3),  resuspended 
in  serum-free  RPMI-1640  medium  (2  mM  L-glutamine,  120  Jig/ml  penicillin, 
100  Rg/ml  streptomycin,  50  jlg/ml  ceftazidime,  0.25  jlg/ml  amphotericin-B, 
20  mM  HEPES,  1  mM  sodium  pyruvate)  together  with  100  units/ml  DNase-I 
and  200  units/ml  Collagenase-III  (Worthington)  and  incubated  for  2  h  at 
37  °C  to  obtain  enzymatic  disaggregation.  Cell  suspensions  were  serially 
filtered  with  sterile  gauze,  70-jim  and  40-pm  nylon  meshes.  Red  blood  cells 
were  removed  by  osmotic  lysis  with  ACK  hypotonic  buffer  (150  mM  NH4C1, 
1  mM  KHC03;  5  min  on  ice). 

Flow  cytometry  and  single-cell  sorting  experiments.  To  minimize  loss  of 
cell  viability,  we  performed  experiments  on  fresh  cell  suspensions,  prepared 
shortly  before  flow  cytometry37.  Antibody  staining  was  performed  in  HBSS 
supplemented  with  2%  heat-inactivated  calf  serum,  120  pg/ml  penicillin, 
100  jlg/ml  streptomycin,  50  |lg/ml  ceftazidime,  0.25  Jig/ml  amphotericin-B, 
20  mM  HEPES,  1  mM  sodium  pyruvate  and  5  mM  EDTA.  To  minimize 
unspecific  binding  of  antibodies,  cells  were  first  incubated  with  0.6%  human 
IgGs  (Gammagard  Liquid;  Baxter)  for  10  min  on  ice,  at  a  concentration  of 
3-5  x  105  cells/100  pi.  Cells  were  subsequently  washed  and  stained  with 
antibodies  at  dilutions  determined  by  appropriate  titration  experiments. 
Antibodies  used  in  this  study  include  anti-human  EpCAM-FITC  or  PE 
(clone  EBA-1;  BD  Biosciences),  anti-human  CD44-APC  (clone  G44-26; 
BD  Biosciences)  and  anti-human  CD66a-PE  (clone  283340;  R&D  Systems). 
Cells  positive  for  expression  of  nonepithelial  lineage  markers  (Lin+)  were 
excluded  by  staining  with  PE.Cy5-labeled  antibodies  using  different  strate¬ 
gies  for  primary  tissues  and  mouse  xenografts.  In  experiments  on  primary 
human  tissues,  stromal  cells  were  excluded  by  staining  with  anti-human 
CD3-biotin  (clone  UCHT1;  BD  Biosciences),  CD16-biotin  (clone  3G8;  BD 
Biosciences),  CD45-biotin  (clone  HI30;  BD  Biosciences),  and  CD64-biotin 
(clone  10.1;  BD  Biosciences)  +  streptavidin-PE/Cy5  (BD  Biosciences).  In 
experiments  on  human  colon  cancer  xenografts,  mouse  cells  were  excluded 
by  staining  with  anti-mouse  CD45-PE/Cy5  (clone  30-F11;  BD  Biosciences) 
and  anti-mouse  H-2Kd-biotin  (clone  SF  1  —  1.1;  BD  Biosciences)  + 
streptavidin-PE/Cy5  (BD  Biosciences).  After  15  min  on  ice,  stained  cells 
were  washed  of  excess  unbound  antibodies  and  resuspended  in  HBSS  with 
2%  heat-inactivated  calf  serum,  20  mM  HEPES,  5  mM  EDTA,  1  mM  sodium 
pyruvate  and  1.1  pM  DAPI  dilactate  (Molecular  Probes).  Flow-cytometry 
analysis  was  performed  using  a  BD  FACSAriall  cell-sorter  (Becton 
Dickinson).  Forward-scatter  height  versus  forward-scatter  width  (FSC-H 
versus  FSC-W)  and  side-scatter  height  versus  side-scatter  width  (SSC-H 
versus  SSC-W)  profiles  were  used  to  eliminate  cell  doublets.  Dead  cells 


were  eliminated  by  excluding  DAPI+  cells,  whereas  contaminating  human 
or  mouse  Lin+  cells  were  eliminated  by  excluding  PE/Cy5+  cells. 

In  single-cell  sorting  experiments,  each  single  (n  =  1)  cell  was  individually 
sorted  into  a  different  well  of  a  96-well  PCR  plate,  using  a  protocol  already 
built-in  within  the  FACSAriall  software  package,  with  appropriate  adjustments 
(device:  96-well  plate;  precision:  single-cell;  nozzle:  130  pm). 

Single-cell  PCR.  Single-cell  gene-expression  experiments  were  performed 
using  Fluidigm’s  M96  quantitative  PCR  (qPCR)  DynamicArray  microfluidic 
chips  (Fluidigm).  Single  cells  were  sorted  by  FACS  into  individual  wells  of 
96-well  PCR  plates  as  described  above.  Each  96-well  plate  was  preloaded 
with  5  pl/well  of  CellsDirect  PCR  mix  (Invitrogen)  and  0.1  pl/well  (2  U) 
of  Superaseln  RNase- inhibitor.  Following  single-cell  sorting,  each  well  was 
supplemented  with  1  pi  (Applied  Biosystems)  of  SuperScript-III  RT/Platinum 
Taq  (Invitrogen),  1.5  pi  of  Tris-EDTA  (TE)  buffer  and  2.5  pi  of  a  mixture  of 
96  pooled  TaqMan  assays  (Applied  Biosystems)  containing  each  assay  at 
1:100  dilution.  Single-cell  mRNA  was  directly  reverse  transcribed  into  cDNA 
(50  °C  for  15  min,  95  °C  for  2  min),  pre-amplified  for  20  cycles  (each  cycle: 
95  °C  for  15  s,  60  °C  for  4  min)  and  diluted  1:3  with  TE  buffer.  A  2.25  pi  aliquot 
of  amplified  cDNA  was  then  mixed  with  2.5  pi  of  TaqMan  Universal  PCR 
Master  Mix  (Applied  Biosystems)  and  0.25  pi  of  Fluidigm’s  “sample  loading 
agent,”  then  inserted  into  one  of  the  chip  “sample”  inlets.  Individual  TaqMan 
assays  were  diluted  at  1:1  ratios  with  TE.  A  2.5  pi  aliquot  of  each  diluted 
TaqMan  assay  was  mixed  with  2.5  pi  of  Fluidigm’s  “assay  loading  agent”  and 
individually  inserted  into  one  of  the  chip  “assay”  inlets.  Samples  and  probes 
were  loaded  into  M96  chips  using  an  IFC  Controller  HX  (Fluidigm),  then 
transferred  to  a  BioMark  real-time  PCR  reader  (Fluidigm)  following  manu¬ 
facturer’s  instructions.  A  list  of  the  57  TaqMan  assays  used  in  this  study  is 
provided  in  Supplementary  Table  2. 

Analysis  and  graphic  display  of  single-cell  PCR  data.  Single-cell  PCR  data 
were  analyzed  and  displayed  using  MATLAB  (Math Works)  as  summarized 
in  Supplementary  Figure  2.  A  minimum  of  336  cells  were  analyzed  for  each 
phenotypic  population,  corresponding  to  four  PCR  plates,  each  contain¬ 
ing  84  single  cells  (84  x  4  =  336),  eight  positive  and  four  negative  controls. 
As  positive  controls,  we  used  replicates  of  a  1 : 1 : 1  mixture  of  total  RNA  stand¬ 
ards  from  human  normal  colon  (AM7986),  human  normal  testes  (AM7972) 
and  HeLa  cells  (AM7852),  all  from  Applied  Biosystems.  Results  from  cells 
not  expressing  ACTB  ((3-actin)  and  GAPDH  (glyceraldehyde  3-phosphate 
dehydrogenase),  or  expressing  them  at  extremely  low  values  (Ct  >35),  were 
removed  from  the  analysis.  Gene-expression  results  were  normalized  by  mean 
centering  and  dividing  by  3  times  the  standard  deviation  (3  s.d.)  of  expressing 
cells  (Supplementary  Fig.  2),  and  visualized  using  both  hierarchical  cluster¬ 
ing  and  PCA12’46.  Hierarchical  clustering  was  performed  both  on  cells  and 
genes,  based  on  Euclidean  or  correlation  distance  metric  and  complete  linkage. 
Positive  or  negative  associations  between  two  genes  were  tested  by  Spearman 
correlation,  and  P-  values  calculated  based  on  10,000  permutations.  Both  hier¬ 
archical  clustering  and  PCA  were  based  on  the  results  for  47  differentially 
expressed  genes  (51  assays),  and  excluded  results  from  housekeeping  (ACTB, 
GAPDH,  EpCAM )  and  proliferation-related  genes  ( MKI67 ,  TOP2A,  BIRC )  to 
avoid  noise  based  on  proliferation  status.  A  detailed  description  of  all  these 
procedures  is  provided  in  the  Supplementary  Methods. 

Immunohistochemistry  and  immunofluorescence.  Paraffin-embedded 
tissue  sections  were  stained  with  anti-human  CK20  (clone  Ks20.8, 
DakoCytomation),  MUC2  (clone  Ccp58,  Fitzgerald  Industries),  MKI67 
(clone  MIB-1,  DakoCytomation),  CEACAMl/CD66a  (clone  283340;  R&D 
Systems)  and  SLC26A3  (lot  no.  R32905,  Sigma  Life  Science)  antibodies, 
according  to  manufacturers’  instructions.  Frozen  tissue  sections  were  stained 
with  an  anti-human  CD  177  antibody  (clone  MEM- 166,  BD  Biosciences)  fol¬ 
lowed  by  secondary  staining  with  goat  anti-mouse  IgG-Alexa488  (Invitrogen). 
A  description  of  immunohistochemistry  and  immunofluorescence  protocols 
is  provided  in  the  Supplementary  Methods. 

Generation  and  characterization  of  monoclonal  tumors.  EpCAMhl§h/CD44+ 
human  colon  cancer  cells  were  infected  with  the  pLL3.7  lentivirus  (Addgene 
#1 1795)47.  Cells  were  infected  by  spin-inoculation  for  4  h  and  injected  in  bulk 
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into  the  subcutaneous  tissue  of  a  NOD/SCID/IL2RY-7-  mice.  The  resulting 
tumors  were  analyzed  to  evaluate  infection  efficiency,  and  EGFP+/EpCAMhl§h/ 
CD44+  were  re-sorted  and  injected  as  single  cells,  again  into  NOD/SCID/ 
IL2Ry_/_  mice.  Monoclonal  origin  of  tumors  originated  from  single  {n  =  1) 
lentivirus-infected  EpCAMhl§h/CD44+  cancer  cells  was  confirmed  by  ligation- 
mediated  PCR  (LM-PCR)48,  followed  by  DNA  sequencing  of  LM-PCR  ampli¬ 
fication  products.  In  the  case  of  UM-COLON#4  clone  8,  DNA  sequencing  of 
LM-PCR  amplification  products  pinpointed  the  provirus  integration- site  on 
the  long  arm  of  human  chromosome  19  (19ql3.3),in  proximity  of  the  AP3D1 
gene  (adaptor- related  protein  complex  3,  delta  1  subunit).  For  a  visual  guide  on 
how  to  interpret  LM-PCR  results  refer  to  Supplementary  Figure  24. 

Tumorigenicity  experiments.  Tumorigenicity  experiments  were  performed 
in  NOD/SCID/IL2Ry_/_  immunodeficient  mice  following  previously  pub¬ 
lished  protocols37,49,50  and  Stanford  University’s  institutional  animal  welfare 
guidelines.  Tumorigenic  cell  frequencies  were  calculated  by  limiting  dilution 
using  the  L-Calc  software  (StemCell  Technologies).  A  detailed  description  of  the 
protocols  used  for  tumorigenicity  experiments  is  provided  in  the 
Supplementary  Methods. 

Bioinformatic  data  collection  and  assemblage  of  the  “human  colon  global 
database.”  A  collection  of  46,047  publicly  available  human  gene-expression 
arrays  (25,721  arrays  on  Affymetrix  U133  Plus  2.0, 16,357  arrays  on  Affymetrix 
U133A,  3,969  arrays  on  Affymetrix  U 133 A  2.0)  was  downloaded  from  NCBI’s 
GEO  database  and  normalized  using  the  RMA  (Robust  Multi-chip  Average) 
algorithm.  Normalization  was  done  either  independently  for  each  platform 
or  on  the  whole  array  collection,  using  a  modified  CDF  (chip  description  file) 
reduced  to  contain  only  shared  probes.  From  this  general  collection,  which 
contained  arrays  from  all  types  of  human  samples,  we  extracted  a  subset  of 
1,684  unique  arrays  from  human  colon  tissues,  either  normal  or  cancerous. 
We  named  this  subset  the  “human  colon  global  database,”  and  we  annotated 
all  its  samples  as  normal  colon  ( n  =  173),  benign  colonic  adenoma  ( n  =  68)  or 
colorectal  cancer  ( n  =  1443).  To  avoid  redundancies  (that  is,  identical  samples 
deposited  two  or  more  times  in  independent  GEO  data  sets)  we  cross-checked 
all  samples  and  removed  duplicates.  When  available,  we  collected  all  available 
clinical,  pathological  and  molecular  information  related  to  the  corresponding 
patients.  As  not  all  arrays  were  annotated  for  all  variables,  individual  hypoth¬ 
eses  were  tested  on  specific  subsets  of  the  “human  colon  global  database.” 
A  list  of  all  GEO  data  sets  used  in  this  study,  and  of  their  contribution  to  dif¬ 
ferent  experiments,  is  provided  in  Supplementary  Table  1. 

Mining  of  gene-expression  arrays  using  Boolean  implications.  Gene-expression 
thresholds  between  positive  and  negative  samples  were  defined  using  the 
StepMiner  algorithm40,  and  Boolean  implication  relationships  between  pairs 
U  of  genes  using  the  BooleanNet  software20.  Briefly,  for  each  gene,  individual 
samples  were  ordered  from  low-to-high  based  on  their  gene-expression  values, 
and  a  rising  step  function  was  fit  to  the  data,  trying  to  minimize  differences 
between  fitted  and  measured  values.  This  method  identifies  a  “step”  at  the 
point  of  largest  jump  from  low  to  high  values  (but  only  if  a  sufficient  number 
of  gene-expression  values  is  present  on  each  side  of  the  jump  to  exclude  a 
random  oscillation  due  to  noise)  and  sets  the  gene- expression  threshold  at  the 
value  corresponding  to  the  step40.  An  intermediate  region  is  defined  around 
the  threshold,  with  a  width  of  1  (threshold  +/-0.5),  corresponding  to  a  twofold 
change  in  expression  levels,  which  represents  the  minimum  noise  in  these  data 
sets20,40.  All  samples  below  the  intermediate  region  (<  1st  StepMiner  threshold 
-  0.5)  are  considered  negative,  and  all  samples  above  the  intermediate  region 
(>  1st  StepMiner  threshold  +  0.5)  are  considered  positive.  When  gene-expression 
levels  display  a  large  dynamic  range,  the  StepMiner  algorithm  can  be  used  to 
calculate  two  distinct  thresholds:  a  first  threshold  to  discriminate  between 
“negative”  and  “positive”  samples  (1st  StepMiner  threshold)  and  a  second 
threshold  to  split  “positive”  samples  into  two  subgroups  with  “low”  and  “high” 
gene-expression  (2nd  StepMiner  threshold;  Supplementary  Fig.  20). 

We  started  our  search  for  developmentally  regulated  genes  on  the  “human 
colon  global  database”  (Supplementary  Table  1).  To  minimize  the  risk  of 
results  being  affected  by  samples  containing  substantial  contaminations 
from  tissues  other  than  colorectal  epithelium  (e.g.,  normal  liver  tissue  in 
hepatic  metastases),  we  restricted  our  investigation  to  the  subset  of  arrays 


with  an  EpCAM+/albumin~  gene-expression  profile  (Supplementary  Fig.  6). 
Threshold  gene-expression  levels  were  calculated  using  the  StepMiner 
algorithm,  based  on  the  1,684  arrays  of  the  “human  colon  global  database” 
(EpCAM+:  Affymetrix  probe  201839_s_at  >10.05;  albumin Affymetrix 
probe  21 1298_s_at  <7.97).  This  operation  removed  1 16  arrays  (6.9%)  and  left 
1,568  arrays  (93.1%)  for  analysis  (normal  colon:  n  =  170;  colorectal  adenoma: 
n  =  68;  colorectal  carcinoma:  n  =  1,330). 

Boolean  implication  relationships  between  pairs  of  genes  were  systemati¬ 
cally  computed  using  the  BooleanNet  software20.  Mature  enterocyte  genes  were 
predicted  as  genes  highly  expressed  in  KRT20+  arrays  and  filtered  based  on  the 
fulfillment  of  the  “X+  implies  KRT20+”  Boolean  implication  (Supplementary 
Fig.  7).  Goblet  genes  were  predicted  as  genes  highly  expressed  in  MUC2+ 
arrays  and  filtered  based  on  the  fulfillment  of  at  least  one  of  three  independent 
Boolean  implications:  “ MUC2  is  equivalent  toX”,  “X+  implies  MUC2+ ”,  “MUC2+ 
implies  X+”  (Supplementary  Fig.  8).  Immature  genes  were  predicted  as  genes 
highly  expressed  in  KRT20~  arrays,  and  filtered  based  on  the  fulfillment  of  the 
“KRT20~  implies  X+”  Boolean  implication  (Supplementary  Fig.  9).  Threshold 
gene-expression  levels  were  calculated  using  the  StepMiner  algorithm,  based  on 
the  global  collection  of  46,047  human  arrays.  Gene-expression  patterns  were 
considered  to  fulfill  a  Boolean  implication  when  the  false-discovery  rate  (FDR) 
of  a  sparsity  test  in  the  relevant  quadrant  was  <0.05  (ref.  20). 

Differences  in  gene-expression  levels  among  different  sample  groups  (e.g., 
normal  versus  adenoma)  were  evaluated  using  box  plots  and  tested  for  statistical 
significance  using  a  2-sample  f-test  (2-tailed).  Correlations  between  two  genes’ 
expression  levels  were  measured  using  Pearson  correlation  coefficients. 

Stratification  of  human  colon  cancer  patients  in  distinct  gene-expression 
groups.  Associations  between  gene-expression  profiles  and  patient  survival 
were  investigated  using  a  new  bioinformatics  tool,  named  Hegemon.  Hegemon 
is  an  upgrade  of  the  BooleanNet  software,  where  individual  gene-expression 
arrays,  after  being  plotted  on  a  two-axis  chart  based  on  the  expression  of  two 
given  genes20,  can  be  grouped  and  compared  for  survival  outcomes,  using  both 
Kaplan-Meier  curves  and  multivariate  analysis  based  on  the  Cox  proportional 
hazards  method. 

Survival  analysis  was  done  on  a  gene-expression  database  annotated  with 
disease-free  survival  (DFS)  information  on  299  patients  from  three  institu¬ 
tions:  H.  Lee  Moffit  Cancer  Center  ( n  =  164),  Vanderbilt  Medical  Center 
( n  =  55)  and  Royal  Melbourne  Hospital  ( n  =  80).  This  database  was  created 
by  pooling  information  from  two  publicly  available  and  partially  redundant 
GEO  data  sets  (GSE14333,  GSE17538;  Supplementary  Table  l)41,42,  both 
collected  on  Affymetrix  U133  Plus  2.0.  To  avoid  bias  due  to  redundancies 
(that  is,  identical  samples  deposited  in  both  GEO  data  sets),  we  cross-checked 
all  samples  and  removed  duplicates. 

Guided  by  single-cell  PCR  results,  we  chose  to  stratify  patients  using  four 
genes  characteristic  of  top-of-the-crypt  CA1+/SLC26A3+  cells  ( CA1 ,  MS4A12, 
CD  177,  SLC26A3)  as  markers  of  terminal  differentiation,  and  using  KRT20, 
whose  expression  is  observed  in  both  top-of-the-crypt  CA1 +/SLC26A3+  cells  and 
a  subset  of  MUC2+/TFF3hl8h  goblet-type  cells,  as  a  more  promiscuous  marker 
of  both  intermediate  and  terminal  differentiation.  The  hypothesis  behind  this 
approach  was  that,  on  average,  a  tumor’s  overall  gene-expression  profile  would 
most  closely  resemble  that  of  the  most  abundant  cell  population.  Thus,  tumors 
highly  enriched  in  mature,  terminally  differentiated  cell  types  would  be  char¬ 
acterized  by  a  lower  proliferation  rate  and/or  a  lower  content  of  long-term  self- 
renewing  cells,  and  be  associated  with  a  better  prognosis  as  compared  to  tumors 
predominantly  composed  by  immature,  progenitor-like  cells. 

Threshold  gene-expression  levels  were  calculated  using  the  StepMiner 
algorithm,  based  on  the  25,576  arrays  on  Affymetrix  U133  Plus  2.0.  KRT20 
expression  (Affymetrix  probe  213953_at)  was  tested  as  a  marker  to  separate 
poorly  differentiated  tumors  ( KRT20~ )  from  differentiated  ones  ( KRT20+ ). 
Based  on  our  previous  experience40,  we  defined  as  KRT20~  all  tumors  whose 
KRT20  expression  values  were  <  1st  StepMiner  threshold  -  0.5  (Affymetrix 
probe  21 3953_at  <  7.00).  Genes  expressed  in  top-of-the-crypt  CA1+/SLC26A3+ 
cells  ( CA1 ,  MS4A12,  CD177,  SLC26A3 )  were  tested  as  markers  to  separate 
terminally  differentiated  tumors  ( top-crypthl2h )  from  moderately  differentiated 
ones  ( top-crypt~/low ).  In  the  case  of  CD177  (Affymetrix  probe  219669_at)  and 
SLC26A3  (Affymetrix  probes  215657_at),  the  sensitivity  of  the  probe  appeared 
lower,  and  its  dynamic  range  narrower,  as  compared  to  CA1  (Affymetrix  probe 
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205950_s_at)  or  MS4A12  (Affymetrix probe  220834_at)  (Supplementary  Fig.  7). 
To  maintain  consistency  in  grouping  samples  with  the  highest  expression  levels, 
we  adopted  a  scaled  approach  designed  to  match  the  different  sensitivity  of 
individual  gene-expression  probes  (Supplementary  Fig.  20).  In  the  case  of 
CD  177  and  SLC26A3,  we  chose  to  separate  negative  samples  from  positive  ones 
( CD1 77~  versus  CD1 77+,  SLC26A3~  versus  SLC26A3+ ),  whereas  in  the  case  of 
CA1  and  MS4A12  we  chose  to  separate  low-to- negative  expression  samples 
from  high  expression  ones  ( CAl~/low  versus  CAlhl8h,  MS4A12~/low  versus 
MS4A12hlSh).  As  a  result,  when  we  tested  CD  177  or  SLC26A3  we  defined 
as  top-crypthl8h  all  tumors  that  scored  as  CD177+  or  SLC26A3+,  defined  as 
expression  values  >  1st  StepMiner  threshold  +  0.5  (CD177:  Affymetrix  probe 
219669_at  >  8.14;  SLC26A3 :  Affymetrix  probe  215657_at  >  5.43),  and  when 
we  tested  CA1  or  MS4A12  we  defined  as  top-crypthl8h  all  tumors  that  scored  as 
CAlhlSh  or  MS4A12hl8hy  defined  as  expression  values  >  2nd  StepMiner  thresh¬ 
old  {CA1:  Affymetrix  probe  205950_s_at  >  11.14;  MS4A12:  Affymetrix  probe 
220834_at  >  9.27). 

Based  on  these  definitions,  we  stratified  colon  tumors  into  three  “gene-expres¬ 
sion  groups”:  Group  1  (KRT20+/top-crypthlSh),  Group  2  ( KRT20+/top-crypt~/low ), 
Group  3  ( KRT20~/top-crypt~/low ).  As  predicted  by  the  strong  Boolean  relation¬ 
ship  linking  KRT20  to  all  mature  enterocyte  genes  (Supplementary  Fig.  7), 
no  tumors  were  observed  that  corresponded  to  the  theoretical  fourth  group 
(KRT20~/top-crypthl8h),  with  the  only  exception  of  one  isolated  sample  in  the 
KRT20/SLC26A3  experiment.  In  experiments  involving  comparisons  to  the 
EphB2+  “intestinal  stem  cell”  (EphB2-ISC)  signature  (Supplementary  Fig.  23), 


tumors  were  grouped  in  three  categories  (EphB2-ISClow,  EphB2-ISCmedlum, 
EphB2-ISChiSh),  as  described  in  Merlos-Suarez  et  al.19. 

Survival  analysis  and  other  statistical  tests.  Once  grouped  based  on  gene- 
expression  profiles,  patient  subsets  were  compared  for  survival  outcomes  using 
Kaplan-Meier  curves  and  multivariate  analysis  based  on  the  Cox  proportional 
hazards  method.  Differences  in  Kaplan-Meier  curves  were  tested  for  statistical 
significance  using  the  log-rank  test.  Enrichment  of  selected  pathological  or 
molecular  features,  such  as  high  pathological  grade  (G3-G4)  or  microsatellite 
instability  (MSI),  in  groups  characterized  by  immature  gene-expression 
patterns  (Group-3,  KRT20~/top-crypt~/low )  was  measured  using  odds-ratios 
and  tested  for  significance  using  Pearson’s  %2  test. 
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Human  diseases  have  been  investigated  in  the  context  of  single  genes  as  well  as  complex 
networks  of  genes.  Though  single  gene  approaches  have  been  extremely  successful  in 
the  past,  most  human  diseases  are  complex  and  better  characterized  by  multiple  interact¬ 
ing  genes  commonly  known  as  networks  or  pathways.  With  the  advent  of  high-throughput 
technologies,  a  recent  trend  has  been  to  apply  network-based  analysis  to  the  huge  amount 
of  biological  data.  Analysis  on  Boolean  implication  network  is  one  such  technique  that  dis¬ 
tinguishes  itself  based  on  its  simplicity  and  robustness.  Unlike  traditional  analyses,  Boolean 
implication  networks  have  the  power  to  break  into  the  mechanistic  insights  of  human  dis¬ 
eases.  A  Boolean  implication  network  is  a  collection  of  simple  Boolean  relationships  such 
as  "if  A  is  high  then  B  is  low."  So  far,  Boolean  implication  networks  have  been  employed 
not  only  to  discover  novel  markers  of  differentiation  in  both  normal  and  cancer  tissues, 
but  also  to  develop  robust  treatment  decisions  for  cancer  patients.  Therefore,  analyses 
based  on  Boolean  implication  networks  have  potential  to  accelerate  discoveries  in  human 
diseases,  suggest  therapeutics,  and  provide  robust  risk-adapted  clinical  strategies. 


Keywords:  bioinformatics,  cancer,  computational  biology,  differentiation,  microarray  analysis,  prognostic  biomark¬ 
ers,  stem  cell,  systems  biology 


INTRODUCTION 

In  the  past  detailed  single  gene  investigations  in  the  context  of 
human  diseases  was  extremely  successful  and  produced  many 
useful  drugs  (Miller  et  al.,  1982;  Slamon  et  al.,  2001;  Cunning¬ 
ham  et  al.,  2004;  Scott  et  al.,  2012).  However,  the  progress  was 
extremely  slow  and  the  success  was  achieved  at  the  cost  of  a  huge 
number  of  failed  investigations  with  multiple  billions  of  dollars 
in  investments  (Arrowsmith,  2011;  Allison,  2012).  Unlike  in  the 
past  years,  it  is  now  easy  to  gather  information  from  tens  of  thou¬ 
sands  of  genes  simultaneously.  Modern  approaches  can  leverage 
these  huge  amounts  of  biological  data  to  understand  human  dis¬ 
eases.  Therefore,  a  recent  trend  in  analysis  has  been  shifted  to 
multiple  genes  that  are  part  of  a  single  functional  unit  commonly 
known  as  networks  or  pathways.  The  new  approaches  have  been 
termed  network  analysis  or  systems  biology.  Clearly,  these  new 
approaches  have  the  potential  to  tackle  the  complexity  of  human 
diseases  (Mootha  et  al.,  2003;  Segal  et  al.,  2003;  Basso  et  al.,  2005; 
Subramanian  et  al.,  2005;  Margolin  et  al.,  2006;  Bonneau  et  al., 
2007;  Lee  et  al.,  2009;  Schadt  et  al.,  2010;  Bousquet  et  al.,  2011; 
Gupta  et  al.,  2011;  Jornsten  et  al.,  2011).  However,  the  systematic 
noise  in  the  system  has  always  challenged  these  approaches.  The 
noise  in  the  system  is  due  to  experimental  or  biological  noise  and 
also  noise  in  measuring  gene  expression  values  in  a  microarray 
hybridization  experiment.  In  addition  to  noise,  other  challenge  to 
the  network-based  approaches  is  to  translate  the  discoveries  to  the 
clinic. 

In  this  mini  review,  we  discuss  a  systems  biology  or  network- 
based  analysis  using  Boolean  implication  network  (Sahoo  et  al., 
2008).  A  Boolean  implication  network  is  simply  a  collection  of 
Boolean  implication  relationships  as  described  by  Sahoo  et  al. 
(2008).  Boolean  typically  means  a  logic  calculus  of  two  values, 


which  are  high  and  low  gene  expression  values  in  this  context.  A 
Boolean  implication  relationship  is  a  simple  “if-then”  relationship 
between  the  high  and  low  gene  expression  values  between  a  pair  of 
genes.  For  example,  “if  A  is  high,  then  B  is  high”  is  a  Boolean  impli¬ 
cation  relationship  between  a  pair  of  genes  A  and  B,  where  A  high 
and  B  low  is  ruled  out  as  a  possible  scenario  as  shown  in  Figure  1 . 
Therefore,  whenever  gene  expression  of  A  is  high,  we  observe  gene 
expression  of  B  is  also  high.  In  other  words,  A  high  is  a  subset  of 
B  high.  In  a  two  dimensional  scatter  plot  between  two  genes  and 
their  thresholds  for  high  and  low  values,  there  are  four  possible 
quadrants:  “A  low  B  low,”  “A  low  B  high,”  “A  high  B  low,”  and  “A 
high  B  high.”  One  or  more  sparse  quadrants  in  this  plot  is  math¬ 
ematically  represented  as  a  Boolean  implication.  For  example,  the 
Boolean  implication  “if  A  high,  then  B  high”  represent  a  sparse 
“A  high  B  low”  quadrant.  There  are  six  possible  Boolean  impli¬ 
cation  relationships,  two  of  them  are  symmetric,  and  other  four 
are  asymmetric.  The  symmetric  Boolean  implication  relationship 
has  two  diagonally  opposite  sparse  quadrant  and  the  asymmetric 
Boolean  implication  relationship  has  only  one  sparse  quadrant. 
As  shown  in  Figure  1,  the  threshold  to  define  “high”  and  “low” 
gene  expression  levels  are  determined  using  StepMiner  (Sahoo 
et  al.,  2007).  The  expression  levels  of  each  probeset  are  sorted  and 
a  step  function  fitted  (using  StepMiner)  to  the  sorted  expression 
level  that  minimizes  the  square  error  between  the  original  and 
the  fitted  values.  We  determined  the  noise  margin  by  using  very 
tightly  correlated  genes  and  found  that  there  is  still  a  difference  of 
twofold  change  (in  log  scale  a  value  of  Miller  et  al.,  1982)  among 
the  values  that  are  linearly  related.  Therefore,  we  used  a  noise 
margin  of  1  (threshold  —0.5  to  threshold  +0.5)  and  discarded  all 
the  microarrays  that  fall  within  these  region  for  Boolean  implica¬ 
tion  analysis.  The  noise  margin  was  an  important  consideration 
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FIGURE  1  |  Boolean  implication  in  gene  expression  database. 

Boolean  implication  is  a  pair-wise  gene  expression  relationship  between 
two  genes  with  respect  to  their  gene  expression  values.  (A)  Schematic 
example  of  a  Boolean  implication  between  two  genes  A  and  B.  Threshold 
to  separate  high  and  low  gene  expression  values  are  computed  using 
StepMiner.  A  noise  margin  of  0.5  is  used  for  statistical  calculations.  Each 
of  the  four  quadrant  is  tested  for  their  sparsity.  In  this  case,  A  high  and  B 


low  quadrant  is  sparse  representing  the  Boolean  implication  "if  A  high, 
then  B  high."  (B)  An  example  of  a  significant  Boolean  implication 
between  ESR1  and  CD9:  if  ESR1  high,  then  CD9  high.  Every  point  is  a 
microarray  experiment  performed  on  human  samples  on  Affymetrix 
platform.  There  are  46,045  microarrays  in  this  scatter  plot  all  of  which 
were  downloaded  from  NCBI's  Gene  Expression  Omnibus  (GEO) 
website. 


that  allowed  us  to  identify  many  significant  Boolean  implication 
relationships. 

SYSTEMS  BIOLOGY  USING  BOOLEAN  IMPLICATION 

It  is  possible  to  discover  Boolean  implication  relationships  in  the 
largest  possible  dataset  that  include  all  publicly  available  microar¬ 
rays  from  Gene  Expression  Omnibus  (GEO)  or  ArrayExpress. 
These  relationships  represent  natural  invariants  in  a  particular 
species.  For  example,  a  Boolean  implication  relationship  in  a  par¬ 
ticular  dataset  that  contains  all  human  samples  on  Affymetrix 
platform  represents  a  natural  invariant  gene  expression  relation¬ 
ship  in  human.  Many  of  these  invariants  are  due  to  tissue  specific 
gene  expression.  For  example,  a  brain  specific  gene  and  a  prostate 
specific  gene  can  never  be  expressed  together.  Therefore,  they  will 
have  a  Boolean  relationship  of  the  form  “if  A  high,  then  B  low.” 
Similarly,  many  of  these  relationships  can  be  due  to  developmental 
gene  expression  pattern  or  related  to  the  biological  process  of  dif¬ 
ferentiation.  Mining  developmentally  regulated  genes  (MiDReG) 
is  a  simple  algorithm  that  uses  Boolean  implication  to  identify 
genes  expressed  at  different  stages  of  differentiation  (Sahoo  et  al., 
2010).  The  key  concept  behind  this  algorithm  is  to  use  invariants 
to  predict  state  of  the  gene  expression  pattern.  We  describe  here 
how  MiDReG  and  Boolean  implication  are  used  in  B  cell,  bladder 
cancer,  and  colon  cancer  differentiation. 

B-CELL  DIFFERENTIATION 

B  cells  are  special  types  of  blood  cell  that  are  created  from  a 
blood  stem  cell  by  the  process  of  differentiation.  As  the  stem  cell 
undergoes  the  process  of  differentiation,  many  genes  changes  their 
expression  pattern.  There  are  genes  that  are  specific  to  the  stem 


cell  only  and  also  there  are  genes  that  are  specific  to  the  differ¬ 
entiated  B  cell.  MiDReG  algorithm  takes  advantage  of  these  gene 
pairs  that  have  a  significant  Boolean  implication  “if  A  high,  then 
B  low,”  and  predict  other  genes  that  are  expressed  in  the  prog¬ 
enitors  or  precursors  of  B  cells  (Inlay  et  al.,  2009;  Sahoo  et  al., 
2010).  Let’s  assume  that  gene  A  is  expressed  at  the  blood  stem 
cells  and  it  turns  off  as  the  stem  cells  differentiate  to  B  cell.  Sim¬ 
ilarly,  let’s  assume  that  gene  B  is  off  at  the  stem  cell  and  it  turns 
on  as  the  stem  cell  differentiates  to  B  cells  (Figure  2A).  There¬ 
fore,  in  this  narrow  view  of  differentiation  gene  A  and  gene  B  are 
mutually  exclusively  expressed.  Let’s  assume  that  there  is  a  signif¬ 
icant  Boolean  implication  “if  A  high,  then  B  low.”  The  significant 
Boolean  implication  represents  a  global  invariant  in  all  microarray 
datasets.  In  this  case,  if  we  want  to  identify  a  gene  X  that  turns  on 
after  gene  A  turns  off  and  before  gene  B  turns  on,  we  could  sim¬ 
ply  use  Boolean  implication  “if  A  high,  X  low,”  and  “if  B  high,  X 
high”  (Figure  2A).  Since  the  Boolean  implication  is  an  invariant, 
we  could  hypothesize  a  state  of  differentiation  where  gene  A  is  off, 
gene  X  is  on,  and  gene  B  is  off.  In  addition,  this  state  of  differ¬ 
entiation  is  between  stem  cell  and  the  mature  B  cell.  Therefore, 
gene  X  could  potentially  mark  precursors  of  the  mature  B  cell. 
We  validated  the  gene  expression  patterns  of  the  newly  discovered 
genes  using  this  approach  by  qPCR  on  the  sorted  B-cell  progeni¬ 
tors  from  mouse  blood  and  bone  marrow.  Review  of  the  published 
literature  of  knockout  mice  revealed  that  many  of  our  discovered 
genes  were  directly  involved  in  B-cell  differentiation.  Out  of  62 
MiDReG  genes,  41  genes  were  found  to  be  knocked  out  in  mice. 
Out  of  these  41  mice  knockouts,  26  (63.4%)  genes  show  defects  in 
B-cell  function  and  differentiation,  9  (22.0%)  genes  are  associated 
with  known  B-cell  function  according  to  other  experiments,  and  6 
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B  MiDReG  in  bladder  cancer  differentiation 


C  MiDReG  in  colorectal  cancer  differentiation 


FIGURE  2  |  Discovery  of  markers  of  differentiation  using  MiDReG 
algorithm.  Mining  developmentally  regulated  genes  (MiDReG)  is  an 
algorithm  that  uses  Boolean  implication  to  predict  specific  markers  of 
differentiation  in  normal  and  cancer  tissues.  (A)  MiDReG  algorithm  is  used 
to  predict  markers  of  B-cell  differentiation.  (B)  MiDReG  algorithm  is  used  to 
predict  markers  of  bladder  cancer  differentiation.  (C)  MiDReG  algorithm  is 
used  to  predict  markers  of  colorectal  cancer  differentiation. 


(14.6%)  genes  could  have  a  B-cell  function  based  on  their  expres¬ 
sion  in  the  B  cell  and  reported  other  hematopoietic  functions. 
A  detailed  analysis  on  mouse  lineages  using  MiDReG  revealed  a 
new  earliest  marker  of  B-cell  differentiation  Ly6D.  This  gene  was 
investigated  in  detail  by  Inlay  et  al.  (2009).  Overall,  our  results 
on  the  B-cell  differentiation  suggested  that  MiDReG  is  a  simple 
but  extremely  powerful  approach  to  discover  novel  markers  of 
progenitor  cells. 

BLADDER  CANCER  DIFFERENTIATION 

Differentiation  within  cancer  is  a  very  controversial  topic  (Reya 
et  al.,  2001).  However,  in  bladder  cancer  it  is  established  that  there 
are  two  different  cell  types  identified  by  Keratin  5  and  Keratin 
20  (Chan  et  al.,  2009).  Keratin  5  marks  immature  cell  types  that 
can  differentiate  to  Keratin  20  positive  cells  (Chan  et  al.,  2009). 
MiDReG  algorithm  was  used  to  identify  an  upstream  marker  Ker¬ 
atin  14  (Volkmer  et  al.,  2012).  There  is  a  significant  Boolean 
implication  relationship  between  Keratin  5  and  Keratin  20  “if 
Keratin  5  high,  then  Keratin  20  low”  that  enabled  the  MiDReG 


algorithm  to  predict  upstream  markers.  In  this  case,  we  are  inter¬ 
ested  in  a  marker  X  that  goes  down  early  compared  to  Keratin 
5.  Thus,  it  is  expressed  at  the  most  immature  state  of  the  can¬ 
cer  cell.  The  candidate  markers  were  chosen  based  on  Boolean 
implication  “if  X  high,  then  Keratin  5  high”  and  “if  X  high,  then 
Keratin  20  low”  (Figure  2B).  Keratin  14  was  one  of  the  markers 
that  satisfied  these  two  Boolean  implication  strongly.  In  addition, 
Keratin  14  was  a  single  prognostic  marker  in  both  gene  and  pro¬ 
tein  expression  datasets.  The  prognostic  power  of  Keratin  14  was 
independent  of  currently  established  stage  and  grade.  Therefore,  a 
simple  immunohistochemical  analysis  can  identify  high  risk  blad¬ 
der  cancer  patients.  Since,  clinicians  decide  whether  to  perform 
cystectomy  which  is  complete  bladder  removal  based  on  stage  and 
grade,  it  is  possible  to  incorporate  Keratin  14  based  risk  stratifi¬ 
cation  into  this  important  clinical  decision  endpoint.  Clinicians 
are  currently  developing  risk- adapted  clinical  strategies  based  on 
Keratin  14  for  bladder  cancer  patients. 

COLON  CANCER  DIFFERENTIATION 

Many  important  markers  in  the  differentiation  of  colon  cancer 
cells  follow  Boolean  implication  (Dalerba  et  al.,  2011).  For  exam¬ 
ple,  there  is  a  significant  Boolean  implication  between  Keratin 
20  and  CA1  “if  CA1  high,  then  Keratin  20  high”  (Figure  2C). 
This  relationship  is  particularly  strong  with  no  exception.  There 
are  no  tumors  with  CA1  high  and  KRT20  low.  Even  in  a  tumor 
when  CA1  positive  cells  are  present  they  have  to  go  through 
a  KRT20  positive  precursor  cell  during  differentiation.  Accord¬ 
ingly,  CA1  positive  cells  are  a  subset  of  Keratin  20  positive  cells  in 
both  normal  colon  and  colorectal  cancer  tissues.  In  addition,  Ker¬ 
atin  20  negative  patients  have  worse  outcome  compared  to  CA1 
positive  and  Keratin  20  positive  cancer  patients.  Other  markers 
such  as  MS4A12,  CD  177,  and  SLC26A3  follow  similar  Boolean 
implication  relationships. 

STRENGTHS  AND  LIMITATIONS 

In  this  review  we  show  that  Boolean  implication  can  be  used  to 
identify  markers  of  differentiation  in  both  normal  and  cancer  tis¬ 
sues.  The  strength  of  Boolean  implication  is  its  ability  to  identify 
asymmetric  gene  expression  relationships.  In  contrast,  most  other 
approaches  focus  on  using  symmetric  gene  expression  relationship 
to  build  gene  expression  network.  We  have  shown  that  some  of  the 
gene  expression  patterns  in  differentiation  can  be  modeled  using 
asymmetric  Boolean  implication.  Therefore,  it  would  be  useful  for 
predicting  important  genes  involved  in  the  process  of  differentia¬ 
tion.  In  addition,  markers  of  differentiation  are  most  likely  robust 
prognostic  biomarkers  in  cancer  patients.  Using  these  markers, 
clinicians  may  be  able  to  develop  better  risk- adapted  treatment 
decisions  for  cancer  patients.  The  limitation  of  Boolean  implica¬ 
tion  is  that  it  requires  large  number  of  samples.  Also,  it  might 
miss  many  other  important  genes  that  are  involved  in  differentia¬ 
tion  but  do  not  have  significant  Boolean  implication.  Accordingly, 
Boolean  implication  is  a  very  stringent  criterion.  Therefore,  it  pulls 
out  many  important  genes  and  appears  to  be  less  noisy  compared 
to  traditional  approaches. 

An  important  distinction  between  Boolean  implication  analy¬ 
ses  compared  to  other  traditional  network-based  analyses  is  that 
most  of  these  other  analyses  are  focused  on  identifying  gene 
regulatory  networks  or  signal  transduction  pathways.  Boolean 
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implication  has  not  been  utilized  to  identify  gene  regulatory  net¬ 
works  or  signaling  networks  which  contains  simple  feed-back  and 
feed-forward  structure.  Instead,  it  was  used  to  identify  cell  type  or 
tissue  specific  gene  expression  patterns  and  they  are  interpreted 
in  terms  of  development  and  differentiation.  This  is  very  differ¬ 
ent  from  Bayesian  or  mutual  information  based  networks  that 
primarily  identify  transcription  factors  and  their  targets  (Segal 
et  al.,  2003;  Basso  et  al.,  2005;  Margolin  et  al.,  2006;  Lee  et  ah, 
2009).  Similarly,  Boolean  implication  analyses  are  also  different 
from  traditional  Boolean  networks  that  are  used  to  build  a  func¬ 
tional  executable  model  or  a  circuit  model  (Glass  and  Kauffman, 
1973;  Shmulevich  and  Kauffman,  2004).  There  are  also  networks 
based  on  ODE  models  which  describes  mechanistic  biochemi¬ 
cal  interactions  (Ferrell  et  ah,  2011).  Both  the  Boolean  and  ODE 
based  approaches  described  above  models  non-linear  dynami¬ 
cal  systems  (Glass  and  Kauffman,  1973;  Shmulevich  and  Kauff¬ 
man,  2004;  Ferrell  et  ah,  2011).  In  contrast,  Boolean  implication 
analyses  models  static  invariant  relationships  in  a  large  biological 
dataset. 

In  summary,  Boolean  implication  is  an  empirically  observed 
relationship  in  the  data,  which  may  not  hold  for  data  gathered 
for  different  tissue  types  or  under  different  conditions.  Like  cor¬ 
relation  networks,  Boolean  implication  networks  do  not  capture 


causality.  Boolean  implication  captures  both  symmetric  as  well 
as  asymmetric  relationships.  It  provides  a  powerful  platform  for 
discovery  of  novel  markers  of  differentiation  in  both  normal  and 
cancer  tissues. 
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Supplmentary  Figure  1:  List  of  prostate  cancer  datasets 


A.  Prostate  Cancer  datasets 
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Pubmed 

RAW 
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yes 

NA 
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Glinsky  GV 

J  Clin  Invest. 
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yes 

NA 

U133A2 

yes 

79 

Lapointe  J 

PNAS 
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yes 

GSE3933 

cDNA 

yes 
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Zuls 

Genome  Res 

2010 

21521786 

yes 

NA 

HEEBO 

yes 
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BMC  Cancer 

2007 
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HG_U95Av2  no 
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Pressinotti  NC  Mol  Cancer 

2010 

20035634 

GS  El  5484 

GPL3050 

no 

65 

Sboner A 

BMC  Med  Genomics 

2010 

20233430 

yes 

GS  El  6560 

GPL5474 

yes 

281 

Wang  Y 

Cancer  Res 

2009 

20663908 

yes 

GS  El  7951 

U133Plus2 

no 
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Taylor  BS 

Cancer  Cell 

2010 

20579941 

yes 
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HuEx-1_0-st 

yes 
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Setlur 

J  Natl  Cancer  Inst 
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yes 
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no 
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B.  Global  Prostate  Cancer  Database 
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Survival  #  patients 
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Environ  Health  Perspect 
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18560533 

yes 
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19 

Berry  PA 

Prostate 

2011 
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yes 
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14 

Best  CJ 

Clin  Cancer  Res 
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yes 
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20 

Birnie  R 

Genome  Biol. 

2008 

18492237 

yes 

E-MEXP-993 

GPL570 

36 

Chambers  KF 

J  Biomed  Sci 

2011 

21696611 

yes 

E-MEXP-2034GPL570 

40 

Guyon  1 

2011 

yes 

E-TABM-456 

GPL96 

85 

Liu  P 

Cancer  Res 

2006 

16618720 

yes 

E-TAB  M-26 

GPL96 

57 

Sun  Y 

Prostate 

2009 

19343730 

yes 

GSE25136 

GPL96 

79 

Traka  M 

PLoS  One. 

2008 

18596959 

yes 

E-MEXP-1243GPL570 

81 

Tsavachidou  D  J  Natl  Cancer  Inst. 

2009 

19244175 

yes 

E-  MEXP- 1 327  GP  L96 

85 

Varambally  S 

Cancer  Cell 

2005 

16286247 

yes 

GSE3325 

GPL570 

19 

Wallace  TA 

Cancer  Res 

2008 

18245496 

yes 

GSE6956 

GPL571 

72 

Wang  Y 

Cancer  Res 

2010 

20663908 

yes 

GSE8218 

GPL96 

130 

Wang  Y 

Cancer  Res 

2010 

20663908 

yes 

GSE17951 

GPL570 

154 

Total  891 


Supplementary  Figure  1 :  List  of  prostate  cancer  datasets.  Panel  A  shows  a  list  of  pub¬ 
licly  available  prostate  cancer  datasets  with  clinical  information  (Only  five  dataset  with 
survival  outcome).  Panel  B  shows  a  list  of  prostate  cancer  datasets  on  Affymetrix 
U133A  (GPL96),  U133A  2.0  (GPL571)  or  U133  Plus  2.0  (GPL570)  microarray  plat¬ 
forms  that  are  normalized  together  to  build  a  large  global  prostate  cancer  database. 
The  lists  include  the  first  author,  journal  where  it  was  first  published,  year  in  which  it 
was  published,  the  PubMed  id,  GEO/Array Express  id,  microarray  platforms,  survival 
annotation,  and  number  of  patients. 


Supplementary  Figure  2:  Infering  developmental  gene  regulation  from 
Boolean  implication  relationship 


A  If  K1 4  high  then  K5  high  Q  If  K14  high  then  K5  high 


Supplementary  Figure  2:  Infering  developmental  gene  regulation  from  Boolean 
implication  relationship.  To  infer  developmental  gene  regulation  (A)  we  use  Bool¬ 
ean  implication  (B).  In  most  human  epithelial  tissues  both  Keratin  5  (K5)  and 
Keratin  14  (K14)  are  expressed  in  the  basal  cell  compartments.  We  analyzed 
gene  expression  values  of  K14  and  K5,  that  is  presented  in  the  form  of  a  scatter- 
plot  with  25,237  points  representing  diverse  microarrays  on  human  samples 
including  different  normal  and  cancer  tissues.  We  summarize  the  gene  expres¬ 
sion  relationship  between  K14  and  K5  as  “if  K14  high  then  K5  high”  or  alterna¬ 
tively  a  Boolean  implication  relationship  “K14  high  =>  K5  high”.  The  relationship 
clearly  suggests  that  K14+  arrays  are  a  subset  of  K5+  arrays.  Since  not  all  cells 
within  a  sample  express  K14  and  K5,  we  could  hypothesize  that  K14+  cells  are 
a  subset  of  K5+  cells  (A)  based  on  the  Boolean  implication.  Panel  A  shows  a 
likely  model  of  developmental  gene  regulation  between  K14  and  K5,  where  K14 
is  upstream  of  K5. 


Supplementary  Figure  3:  Relationship  between  Keratin  gene  expression 
and  clinical  outcome 


A.  Scatter  Plot 


B.  Survival  Analysis 
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Supplementary  Figure  3:  Relationship  between  Keratin  gene  expression  and  clinical  outcome.  To  evaluate 
whether  Keratin  gene  expression  is  associated  with  patient  outcome,  we  investigated  the  status  of  three  Keratin 
expression  groups  (KRT14+KRT5+,  KRT14-KRT5+,  KRT14-KRT5-)  on  recurrence-free  survival  (RFS)  in  three 
independent  prostate  cancer  cohorts  (Singh  2002  dataset,  n=102;  Glinsky  2004  dataset,  n=78;  Taylor  2010  data¬ 
set,  n=185),  The  results  confirmed  that  KRT14-KRT5-  tumors  were  associated  with  worse  clinical  outcomes  (B). 
In  addition,  KRT14+KRT5+  tumors  were  associated  with  best  clinical  and  KRT14-KRT5+  tumors  were  associated 
with  intermediate  clinical  outcome. 


